You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@any23.apache.org by si...@apache.org on 2012/07/01 16:17:46 UTC
svn commit: r1355913 -
/incubator/any23/trunk/src/site/apt/getting-started.apt
Author: simonetripodi
Date: Sun Jul 1 14:17:45 2012
New Revision: 1355913
URL: http://svn.apache.org/viewvc?rev=1355913&view=rev
Log:
updated the crawler plugin
Modified:
incubator/any23/trunk/src/site/apt/getting-started.apt
Modified: incubator/any23/trunk/src/site/apt/getting-started.apt
URL: http://svn.apache.org/viewvc/incubator/any23/trunk/src/site/apt/getting-started.apt?rev=1355913&r1=1355912&r2=1355913&view=diff
==============================================================================
--- incubator/any23/trunk/src/site/apt/getting-started.apt (original)
+++ incubator/any23/trunk/src/site/apt/getting-started.apt Sun Jul 1 14:17:45 2012
@@ -269,49 +269,52 @@ any23-core$ ./bin/any23 verify [/path/to
The <Crawler Plugin> provides basic site crawling and metadata extraction capabilities.
+----------------------------------------------------------------------------
-any23-core/bin$ ./any23tools Crawler
-usage: [{<url>|<file>}]+ [-d <arg>] [-e <arg>] [-f <arg>] [-h] [-l <arg>]
- [-maxdepth <arg>] [-maxpages <arg>] [-n] [-numcrawlers <arg>] [-o
- <arg>] [-p] [-pagefilter <arg>] [-politenessdelay <arg>] [-s]
- [-storagefolder <arg>] [-t] [-v]
- -d,--defaultns <arg> Override the default namespace used to produce
- statements.
- -e <arg> Specify a comma-separated list of extractors,
- e.g. rdf-xml,rdf-turtle.
- -f,--Output format <arg> [turtle (default), rdfxml, ntriples, nquads,
- trix, json, uri]
- -h,--help Print this help.
- -l,--log <arg> Produce log within a file.
- -maxdepth <arg> Max allowed crawler depth. Default: no limit.
- -maxpages <arg> Max number of pages before interrupting crawl.
- Default: no limit.
- -n,--nesting Disable production of nesting triples.
- -numcrawlers <arg> Sets the number of crawlers. Default: 10
- -o,--output <arg> Specify Output file (defaults to standard
- output).
- -p,--pedantic Validate and fixes HTML content detecting
- commons issues.
- -pagefilter <arg> Regex used to filter out page URLs during
- crawling. Default:
- '.*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|
- mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|sm
- il|pdf|swf|zip|rar|gz|xml|txt))$'
- -politenessdelay <arg> Politeness delay in milliseconds. Default: no
- limit.
- -s,--stats Print out extraction statistics.
- -storagefolder <arg> Folder used to store crawler temporary data.
- Default:
- [/var/folders/d5/c_0b4h1d7t1gx6tzz_dn5cj40000g
- q/T/]
- -t,--notrivial Filter trivial statements (e.g. CSS related
- ones).
- -v,--verbose Show debug and progress information.
+any23-core$ ./bin/any23 -h
+[...]
+ crawler Any23 Crawler Command Line Tool.
+ Usage: crawler [options] input URIs {<url>|<file>}+
+ Options:
+ -d, --defaultns Override the default namespace used to
+ produce statements.
+ -e, --extractors a comma-separated list of extractors, e.g.
+ rdf-xml,rdf-turtle
+ Default: []
+ -f, --format the output format
+ Default: turtle
+ -l, --log Produce log within a file.
+ -md, --maxdepth Max allowed crawler depth.
+ Default: 2147483647
+ -mp, --maxpages Max number of pages before interrupting
+ crawl.
+ Default: 2147483647
+ -n, --nesting Disable production of nesting triples.
+ Default: false
+ -t, --notrivial Filter trivial statements (e.g. CSS related
+ ones).
+ Default: false
+ -nc, --numcrawlers Sets the number of crawlers.
+ Default: 10
+ -o, --output Specify Output file (defaults to standard
+ output)
+ Default: java.io.PrintStream@2911a3a4
+ -pf, --pagefilter Regex used to filter out page URLs during
+ crawling.
+ Default: .*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|smil|pdf|swf|zip|rar|gz|xml|txt))$
+ -p, --pedantic Validate and fixes HTML content detecting
+ commons issues.
+ Default: false
+ -pd, --politenessdelay Politeness delay in milliseconds.
+ Default: 2147483647
+ -s, --stats Print out extraction statistics.
+ Default: false
+ -sf, --storagefolder Folder used to store crawler temporary data.
+ Default: /var/folders/zz/9vvv_lbn1cs8dpwz859nmq080000gn/T/crawler-metadata-9ff4c650-10c2-41a1-9d99-ebeb3e7d21ce
+----------------------------------------------------------------------------
A usage example:
+----------------------------------------------------------------------------
-any23-core/bin$ ./any23tools Crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log
+any23-core$ ./bin/any23 crawler -s -f ntriples http://www.repubblica.it 1> out.nt 2> repubblica.log
+----------------------------------------------------------------------------
* Use <<Apache Any23>> as a RESTful Web Service