You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@any23.apache.org by mo...@apache.org on 2012/03/25 16:16:51 UTC
svn commit: r1305044 -
/incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt
Author: mostarda
Date: Sun Mar 25 14:16:51 2012
New Revision: 1305044
URL: http://svn.apache.org/viewvc?rev=1305044&view=rev
Log:
Added Basic Crawler documentation. Related to issue #ANY23-41.
Added:
incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt
Added: incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt
URL: http://svn.apache.org/viewvc/incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt?rev=1305044&view=auto
==============================================================================
--- incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt (added)
+++ incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt Sun Mar 25 14:16:51 2012
@@ -0,0 +1,71 @@
+ ------
+ Apache Any23 - Plugins - Basic Crawler
+ ------
+ The Apache Software Foundation
+ ------
+ 2011-2012
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Basic Crawler Plugin
+
+ The <Basic Crawler Plugin> implements a <CLI> {{{./xref/org/apache/any23/cli/Tool.html}Tool}} extending
+ {{{./xref/org/apache/any23/cli/Rover.html}Rover}} to add <site crawling> capabilities.
+
+ The tool can be used to extract semantic content from a small/medium size sites.
+
+ To use it make sure to have correctly configured the basic-crawler plugin to be found by the
+ <any23tools> script (follow the {{{./any23-plugins.html}Plugins}} section instructions):
+
++--------------------------------------------------------------
+core/bin/$ ./any23tools Crawler
+usage: [{<url>|<file>}]+ [-d <arg>] [-e <arg>] [-f <arg>] [-h] [-l <arg>]
+ [-maxdepth <arg>] [-maxpages <arg>] [-n] [-numcrawlers <arg>] [-o
+ <arg>] [-p] [-pagefilter <arg>] [-politenessdelay <arg>] [-s]
+ [-storagefolder <arg>] [-t] [-v]
+ -d,--defaultns <arg> Override the default namespace used to produce
+ statements.
+ -e <arg> Specify a comma-separated list of extractors,
+ e.g. rdf-xml,rdf-turtle.
+ -f,--Output format <arg> [turtle (default), rdfxml, ntriples, nquads,
+ trix, json, uri]
+ -h,--help Print this help.
+ -l,--log <arg> Produce log within a file.
+ -maxdepth <arg> Max allowed crawler depth. Default: no limit.
+ -maxpages <arg> Max number of pages before interrupting crawl.
+ Default: no limit.
+ -n,--nesting Disable production of nesting triples.
+ -numcrawlers <arg> Sets the number of crawlers. Default: 10
+ -o,--output <arg> Specify Output file (defaults to standard
+ output).
+ -p,--pedantic Validate and fixes HTML content detecting
+ commons issues.
+ -pagefilter <arg> Regex used to filter out page URLs during
+ crawling. Default:
+ '.*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|
+ mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|sm
+ il|pdf|swf|zip|rar|gz|xml|txt))$'
+ -politenessdelay <arg> Politeness delay in milliseconds. Default: no
+ limit.
+ -s,--stats Print out extraction statistics.
+ -storagefolder <arg> Folder used to store crawler temporary data.
+ Default:
+ [/var/folders/d5/c_0b4h1d7t1gx6tzz_dn5cj40000g
+ q/T/]
+ -t,--notrivial Filter trivial statements (e.g. CSS related
+ ones).
+ -v,--verbose Show debug and progress information.
++--------------------------------------------------------------