You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2018/08/15 13:38:59 UTC

[Nutch Wiki] Update of "bin/crawl" by SebastianNagel

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "bin/crawl" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/bin/crawl?action=diff&rev1=2&rev2=3

Comment:
Update to recent version (1.15) of bin/crawl

  = Usage =
  == Nutch 1.X ==
  {{{
-      Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds>
+ Usage: crawl [options] <crawl_dir> <num_rounds>
+ 
+ Arguments:
+   <crawl_dir>                           Directory where the crawl/host/link/segments dirs are saved
+   <num_rounds>                          The number of rounds to run this crawl for
+ 
+ Options:
-         -i|--index      Indexes crawl results into a configured indexer
+   -i|--index                            Indexes crawl results into a configured indexer
-         -D              A Java property to pass to Nutch calls
+   -D                                    A Java property to pass to Nutch calls
-         Seed Dir        Directory in which to look for a seeds file
-         Crawl Dir       Directory where the crawl/link/segments dirs are saved
-         Num Rounds      The number of rounds to run this crawl for
-      Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/  2
+   -w|--wait <NUMBER[SUFFIX]>            Time to wait before generating a new segment when no URLs
+                                         are scheduled for fetching. Suffix can be: s for second,
+                                         m for minute, h for hour and d for day. If no suffix is
+                                         specified second is used by default. [default: -1]
+   -s <seed_dir>                         Path to seeds file(s)
+   -sm <sitemap_dir>                     Path to sitemap URL file(s)
+   --hostdbupdate                                Boolean flag showing if we either update or not update hostdb for each round
+   --hostdbgenerate                      Boolean flag showing if we use hostdb in generate or not
+   --num-slaves <num_slaves>             Number of slave nodes [default: 1]
+                                         Note: This can only be set when running in distribution mode
+   --num-tasks <num_tasks>               Number of reducer tasks [default: 2]
+   --size-fetchlist <size_fetchlist>     Number of URLs to fetch in one iteration [default: 50000]
+   --time-limit-fetch <time_limit_fetch> Number of minutes allocated to the fetching [default: 180]
+   --num-threads <num_threads>           Number of threads for fetching / sitemap processing [default: 50]
+   --sitemaps-from-hostdb <frequency>    Whether and how often to process sitemaps based on HostDB.
+                                         Supported values are:
+                                           - never [default]
+                                           - always (processing takes place in every iteration)
+                                           - once (processing only takes place in the first iteration)
  }}}
  
  == Nutch 2.x ==