You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2011/07/01 06:21:32 UTC
[Nutch Wiki] Update of "bin/nutch_freegen" by LewisJohnMcgibbney
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "bin/nutch_freegen" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_freegen
Comment:
Update to reflect Nutch 1.3 API
New page:
FreeGenerator is an alias for org.apache.nutch.tools.FreeGenerator
This tool generates fetchlists (segments to be fetched) from plain text files containing one URL per line. It's useful when arbitrary URL-s need to be fetched without adding them first to the CrawlDb, or during testing.
Usage:
{{{
bin/nutch FreeGenerator <inputDir> <segmentsDir> [-filter] [-normalize]
}}}
'''<inputDir>''': This should be the path to the input directory containing one or more input (text) files. As with the Injector class, each text file should contain a list of URLs, one URL per line.
'''<segmentsDir>''': The path to the desired output directory, where new segment will be created.
'''[-filter]''': An arguement to run current URLFilters on input URLs to improve the quality of the new segment(s).
'''[-normalize]: This arguement should be passed to run URLNormalizers on input URLs prior to them being used in the process of creating new segments.
CommandLineOptions