You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2005/04/09 03:15:35 UTC
[Nutch Wiki] Update of "InjectOptions" by ChiragChaman
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by ChiragChaman:
http://wiki.apache.org/nutch/InjectOptions
New page:
= bin/nutch inject =
== called java class ==
net.nutch.db.WebDBInjector
== command line options ==
bin/nutch inject <db> (-urlfile <url''file> | -dmozfile <dmoz''file>) [-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] [-noDmozDesc]
== -urlfile <url_file> ==
Injects urls from a text file. Use a file with one url per line.
== -dmozfile <dmoz_file> ==
Injects the urls from a dmoz content file. You can download the current content file from dmoz.org.
== -subset <subsetDenominator> ==
Use this option if you want to inject only one of <subsetDenominator> urls. Injecting and fetching all urls from the open directory means to fetch over 4 million urls. Maybe for testing you would start with fewer urls. For example inject one out of every 4000 urls with -subset 4000, which whould be around 1000 urls injected. A random subset is selected: repeated calls with the same value will inject different urls.
== -includeAdultMaterial ==
By default urls from the adult part of the open directory will not be included.
== -skew skew ==
The seed for the randomization used by subsetDenominator. For debugging.
== -noDmozDesc ==
If specified, the Open Directory description is '''not''' used as a link to the page.
== config file options ==
== db.score.injected ==
The score of new pages added by the injector. 2.0 by default.
== db.default.fetch.interval ==
The number of days after each page injected is fetched that it should next be fetched. 30 by default.
-- MatthiasJaekle - 13 Mar 2004