You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@any23.apache.org by mo...@apache.org on 2012/03/25 16:16:51 UTC

svn commit: r1305044 - /incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt

Author: mostarda
Date: Sun Mar 25 14:16:51 2012
New Revision: 1305044

URL: http://svn.apache.org/viewvc?rev=1305044&view=rev
Log:
Added Basic Crawler documentation. Related to issue #ANY23-41.

Added:
    incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt

Added: incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt
URL: http://svn.apache.org/viewvc/incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt?rev=1305044&view=auto
==============================================================================
--- incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt (added)
+++ incubator/any23/trunk/src/site/apt/plugin-basic-crawler.apt Sun Mar 25 14:16:51 2012
@@ -0,0 +1,71 @@
+                                    ------
+                                    Apache Any23 - Plugins - Basic Crawler
+                                    ------
+                              The Apache Software Foundation
+                                    ------
+                                     2011-2012
+
+~~  Licensed to the Apache Software Foundation (ASF) under one or more
+~~  contributor license agreements.  See the NOTICE file distributed with
+~~  this work for additional information regarding copyright ownership.
+~~  The ASF licenses this file to You under the Apache License, Version 2.0
+~~  (the "License"); you may not use this file except in compliance with
+~~  the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~  Unless required by applicable law or agreed to in writing, software
+~~  distributed under the License is distributed on an "AS IS" BASIS,
+~~  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~  See the License for the specific language governing permissions and
+~~  limitations under the License.
+
+Basic Crawler Plugin
+
+  The <Basic Crawler Plugin> implements a <CLI> {{{./xref/org/apache/any23/cli/Tool.html}Tool}} extending
+  {{{./xref/org/apache/any23/cli/Rover.html}Rover}} to add <site crawling> capabilities.
+
+  The tool can be used to extract semantic content from a small/medium size sites.
+
+  To use it make sure to have correctly configured the basic-crawler plugin to be found by the
+  <any23tools> script (follow the {{{./any23-plugins.html}Plugins}} section instructions):
+
++--------------------------------------------------------------
+core/bin/$ ./any23tools Crawler
+usage: [{<url>|<file>}]+ [-d <arg>] [-e <arg>] [-f <arg>] [-h] [-l <arg>]
+       [-maxdepth <arg>] [-maxpages <arg>] [-n] [-numcrawlers <arg>] [-o
+       <arg>] [-p] [-pagefilter <arg>] [-politenessdelay <arg>] [-s]
+       [-storagefolder <arg>] [-t] [-v]
+ -d,--defaultns <arg>       Override the default namespace used to produce
+                            statements.
+ -e <arg>                   Specify a comma-separated list of extractors,
+                            e.g. rdf-xml,rdf-turtle.
+ -f,--Output format <arg>   [turtle (default), rdfxml, ntriples, nquads,
+                            trix, json, uri]
+ -h,--help                  Print this help.
+ -l,--log <arg>             Produce log within a file.
+ -maxdepth <arg>            Max allowed crawler depth. Default: no limit.
+ -maxpages <arg>            Max number of pages before interrupting crawl.
+                            Default: no limit.
+ -n,--nesting               Disable production of nesting triples.
+ -numcrawlers <arg>         Sets the number of crawlers. Default: 10
+ -o,--output <arg>          Specify Output file (defaults to standard
+                            output).
+ -p,--pedantic              Validate and fixes HTML content detecting
+                            commons issues.
+ -pagefilter <arg>          Regex used to filter out page URLs during
+                            crawling. Default:
+                            '.*(\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|
+                            mp3|mp4|wav|wma|avi|mov|mpeg|ram|m4v|wmv|rm|sm
+                            il|pdf|swf|zip|rar|gz|xml|txt))$'
+ -politenessdelay <arg>     Politeness delay in milliseconds. Default: no
+                            limit.
+ -s,--stats                 Print out extraction statistics.
+ -storagefolder <arg>       Folder used to store crawler temporary data.
+                            Default:
+                            [/var/folders/d5/c_0b4h1d7t1gx6tzz_dn5cj40000g
+                            q/T/]
+ -t,--notrivial             Filter trivial statements (e.g. CSS related
+                            ones).
+ -v,--verbose               Show debug and progress information.
++--------------------------------------------------------------