You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2013/04/27 23:19:25 UTC

[Nutch Wiki] Update of "bin/nutch generate" by TejasPatil

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "bin/nutch generate" page has been changed by TejasPatil:
http://wiki.apache.org/nutch/bin/nutch%20generate?action=diff&rev1=1&rev2=2

Comment:
added the usage for generate in 2.x

  
  This class generates a subset of a crawl db to fetch. This version allows us to generate fetchlists for several segments in one go. Unlike in the initial version (FetchListTool), the IP resolution is done ONLY on the entries which have been selected for fetching. The URLs are partitioned by IP, domain or host within a segment. We can chose separately how to count the URLS i.e. by domain or host to limit the entries.
  
+ === Nutch 1.x ===
  {{{
  Usage: bin/nutch generate <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num]
  }}}
@@ -26, +27 @@

  
  '''[-maxNumSegments num''':
  
- === Configuration Files ===
+ ==== Configuration Files ====
   hadoop-default.xml<<BR>>
   hadoop-site.xml<<BR>>
   nutch-default.xml<<BR>>
   nutch-site.xml<<BR>>
  
- === Configuration Values ===
+ ==== Configuration Values ====
   The following properties directory affect how the Generator generates fetch segments.<<BR>><<BR>>
   * generate.max.count: The maximum number of urls in a single fetchlist.  -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode.
   
   * generate.count.mode: Determines how the URLs are counted for generator.max.count. Default value is 'host' but can be 'domain'. Note that we do not count per IP in the new version of the Generator.
    
- === Examples ===
+ ==== Examples ====
  
  {{{
  bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments
@@ -50, +51 @@

  }}}
   In this example the Generator will add 20 days to the current date/time when determining the top 100 scoring pages to fetch.
  
+ === Nutch 2.x ===
+ {{{
+ Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays numDays]
+     -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE 
+     -crawlId <id>  - the id to prefix the schemas to operate on, 
+  	 	    (default: storage.crawl.id)");
+     -noFilter      - do not activate the filter plugin to filter the url, default is true 
+     -noNorm        - do not activate the normalizer plugin to normalize the url, default is true 
+     -adddays       - Adds numDays to the current time to facilitate crawling urls already
+                      fetched sooner then db.default.fetch.interval. Default value is 0.
+ }}}
  
  CommandLineOptions