You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2018/08/15 13:38:59 UTC
[Nutch Wiki] Update of "bin/crawl" by SebastianNagel
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "bin/crawl" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/bin/crawl?action=diff&rev1=2&rev2=3
Comment:
Update to recent version (1.15) of bin/crawl
= Usage =
== Nutch 1.X ==
{{{
- Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds>
+ Usage: crawl [options] <crawl_dir> <num_rounds>
+
+ Arguments:
+ <crawl_dir> Directory where the crawl/host/link/segments dirs are saved
+ <num_rounds> The number of rounds to run this crawl for
+
+ Options:
- -i|--index Indexes crawl results into a configured indexer
+ -i|--index Indexes crawl results into a configured indexer
- -D A Java property to pass to Nutch calls
+ -D A Java property to pass to Nutch calls
- Seed Dir Directory in which to look for a seeds file
- Crawl Dir Directory where the crawl/link/segments dirs are saved
- Num Rounds The number of rounds to run this crawl for
- Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2
+ -w|--wait <NUMBER[SUFFIX]> Time to wait before generating a new segment when no URLs
+ are scheduled for fetching. Suffix can be: s for second,
+ m for minute, h for hour and d for day. If no suffix is
+ specified second is used by default. [default: -1]
+ -s <seed_dir> Path to seeds file(s)
+ -sm <sitemap_dir> Path to sitemap URL file(s)
+ --hostdbupdate Boolean flag showing if we either update or not update hostdb for each round
+ --hostdbgenerate Boolean flag showing if we use hostdb in generate or not
+ --num-slaves <num_slaves> Number of slave nodes [default: 1]
+ Note: This can only be set when running in distribution mode
+ --num-tasks <num_tasks> Number of reducer tasks [default: 2]
+ --size-fetchlist <size_fetchlist> Number of URLs to fetch in one iteration [default: 50000]
+ --time-limit-fetch <time_limit_fetch> Number of minutes allocated to the fetching [default: 180]
+ --num-threads <num_threads> Number of threads for fetching / sitemap processing [default: 50]
+ --sitemaps-from-hostdb <frequency> Whether and how often to process sitemaps based on HostDB.
+ Supported values are:
+ - never [default]
+ - always (processing takes place in every iteration)
+ - once (processing only takes place in the first iteration)
}}}
== Nutch 2.x ==