You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2008/11/27 02:33:41 UTC
svn commit: r721064 - in /lucene/tika/trunk/src/site: apt/gettingstarted.apt
apt/index.apt site.xml
Author: jukka
Date: Wed Nov 26 17:33:41 2008
New Revision: 721064
URL: http://svn.apache.org/viewvc?rev=721064&view=rev
Log:
TIKA-176: Getting Started guide
First version of the Getting Started guide. Feel free to improve!
Added:
lucene/tika/trunk/src/site/apt/gettingstarted.apt
Modified:
lucene/tika/trunk/src/site/apt/index.apt
lucene/tika/trunk/src/site/site.xml
Added: lucene/tika/trunk/src/site/apt/gettingstarted.apt
URL: http://svn.apache.org/viewvc/lucene/tika/trunk/src/site/apt/gettingstarted.apt?rev=721064&view=auto
==============================================================================
--- lucene/tika/trunk/src/site/apt/gettingstarted.apt (added)
+++ lucene/tika/trunk/src/site/apt/gettingstarted.apt Wed Nov 26 17:33:41 2008
@@ -0,0 +1,232 @@
+ --------------------------------
+ Getting Started with Apache Tika
+ --------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Getting Started with Apache Tika
+
+ This document describes how to build Apache Tika from sources and
+ how to start using Tika in an application.
+
+Getting and building the sources
+
+ To build Tika from sources you first need to either
+ {{{download.html}download}} a source release or
+ {{{source-repository.html}checkout}} the latest sources from
+ version control.
+
+ Once you have the sources, you can build them using the
+ {{{http://maven.apache.org/}Maven 2}} build system. Executing the
+ following command in the source directory will build the sources
+ and install the resulting artifacts in your local Maven repository.
+
+---
+mvn install
+---
+
+ See the Maven documentation for more information about the available
+ build options.
+
+ Note that you need Java 5 or higher to build Tika.
+
+Build artifacts
+
+ The Tika build produces the following libraries in the <<<target>>>
+ directory (x.y stands for the current Tika version number).
+
+ * tika-x.y.jar
+
+ * tika-x.y-standalone.jar
+
+ * tika-x.y-jdk14.jar
+
+ The main build artifact (tika-x.y.jar) contains the compiled Java
+ classes and interfaces in the <<<org.apache.tika>>> packages and
+ the default Tika configuration settings.
+
+ The standalone jar (tika-x.y-standalone.jar, available since version 0.2)
+ includes also the classes and resources from all Tika dependencies. You
+ can just drop this jar file in your application to access the full
+ functionality of all Tika parsers. This is a runnable jar that runs the
+ Tika command line and graphical user interfaces without needing any other
+ libraries (except of course the standard Java 5 class libraries) in the
+ classpath.
+
+ The final build artifact (tika-x.y-jdk14.jar, available since version 0.3)
+ is a {{{http://retrotranslator.sourceforge.net/}retrotranslated}} version
+ of the main Tika build artifact. Normally Tika only works with Java 5 or
+ higher, but you can use this version of Tika also with Java 1.4.
+
+Using Tika as a Maven dependency
+
+ Using Tika in a Maven project is very straightforward. Just select the
+ version of Tika you want to use, and add the following dependency.
+
+---
+<dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika</artifactId>
+ <version>x.y</version>
+</dependency>
+---
+
+ The first version of the org.apache.tika:tika artifact available in the
+ central Maven repository is 0.2. For the 0.1 version or for SNAPSHOT
+ dependencies you need to build and install Tika locally.
+
+ If your application uses Java 1.4, you need to use the retrotranslated
+ version of Tika. This version is identified by the classifier "jdk14".
+
+---
+<dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika</artifactId>
+ <version>x.y</version>
+ <classifier>jdk14</classifier>
+</dependency>
+---
+
+ The retrotranslated version will be available in the central Maven
+ repository starting with Tika version 0.3.
+
+ Note that adding the Tika dependency will introduce a number of
+ transitive dependencies to your project. You need to make sure that
+ these dependencies won't conflict with your existing project dependencies.
+ The listing below shows all the compile-scope dependencies of the
+ current Tika trunk (0.3-SNAPSHOT, November 2008). You can use the
+ command "mvn dependency:tree" to check the latest tree of dependencies.
+
+---
+org.apache.tika:tika:jar:0.3-SNAPSHOT
++- commons-lang:commons-lang:jar:2.1:compile
++- commons-logging:commons-logging:jar:1.0.4:compile
++- commons-codec:commons-codec:jar:1.3:compile
++- commons-io:commons-io:jar:1.4:compile
++- pdfbox:pdfbox:jar:0.7.3:compile
+| +- org.fontbox:fontbox:jar:0.1.0:compile
+| +- org.jempbox:jempbox:jar:0.2.0:compile
+| +- bouncycastle:bcmail-jdk14:jar:136:compile
+| \- bouncycastle:bcprov-jdk14:jar:136:compile
++- org.apache.poi:poi:jar:3.1-FINAL:compile
++- org.apache.poi:poi-scratchpad:jar:3.1-FINAL:compile
++- net.sourceforge.nekohtml:nekohtml:jar:1.9.7:compile
+| \- xerces:xercesImpl:jar:2.8.1:compile
+| \- xml-apis:xml-apis:jar:1.3.03:compile
++- com.ibm.icu:icu4j:jar:3.4.4:compile
++- asm:asm:jar:3.1:compile
+\- log4j:log4j:jar:1.2.14:compile
+---
+
+Using Tika in an Ant project
+
+ Unless you use a dependency manager tool like
+ {{{http://ant.apache.org/ivy/}Apache Ivy}}, the easiest way to include
+ Tika in your {{{http://ant.apache.org/}Ant}} build is to include the
+ standalone jar in your classpath settings. The standalone jar contains
+ everything you need, Tika and all the required dependencies, in a single
+ package.
+
+---
+<classpath>
+ ... <!-- your other classpath entries -->
+ <pathelement location="path/to/tika-x.y-standalone.jar"/>
+</classpath>
+---
+
+ If you want more control over which specific parser libraries you want
+ to include in your application, you can include main Tika jar file and
+ all the dependencies individually.
+
+---
+<classpath>
+ ... <!-- your other classpath entries -->
+ <pathelement location="path/to/tika-x.y.jar"/>
+ <pathelement location="path/to/commons-lang-2.1.jar"/>
+ <pathelement location="path/to/commons-logging-1.0.4.jar"/>
+ <pathelement location="path/to/commons-codec-1.3.jar"/>
+ <pathelement location="path/to/commons-io-1.4.jar"/>
+ <pathelement location="path/to/pdfbox-0.7.3.jar"/>
+ <pathelement location="path/to/fontbox-0.1.0.jar"/>
+ <pathelement location="path/to/jempbox-0.2.0.jar"/>
+ <pathelement location="path/to/bcmail-jdk14-136.jar"/>
+ <pathelement location="path/to/bcprov-jdk14-136.jar"/>
+ <pathelement location="path/to/poi-3.1-FINAL.jar"/>
+ <pathelement location="path/to/poi-scratchpad-3.1-FINAL.jar"/>
+ <pathelement location="path/to/nekohtml-1.9.7.jar"/>
+ <pathelement location="path/to/xercesImpl-2.8.1.jar"/>
+ <pathelement location="path/to/xml-apis-1.3.03.jar"/>
+ <pathelement location="path/to/icu4j-3.4.4.jar"/>
+ <pathelement location="path/to/asm-3.1.jar"/>
+ <pathelement location="path/to/log4j-1.2.14.jar"/>
+</classpath>
+---
+
+ If you're using Java 1.4 as the base platform of your project,
+ use the tika-x.y-jdk14.jar instead.
+
+ An easy way to gather all these libraries is to run
+ "mvn dependency:copy-dependencies" in the Tika source directory.
+ This will copy all Tika dependencies to the <<<target/dependencies>>>
+ directory.
+
+Using Tika as a command line utility
+
+ The standalone jar (tika-x.y-standalone.jar) can be used as a command
+ line utility for extracting text content and metadata from all sorts of
+ files. The usage instructions are shown below.
+
+---
+usage: java -jar tika-x.y-standalone.jar [option] file
+
+Options:
+ -? or --help Print this usage message
+ -v or --verbose Print debug level messages
+ -g or --gui Start the Apache Tika GUI
+ -x or --xml Output XHTML content (default)
+ -h or --html Output HTML content
+ -t or --text Output plain text content
+ -m or --metadata Output only metadata
+
+Description:
+ Apache Tika will parse the file(s) specified on the
+ command line and output the extracted text content
+ or metadata to standard output.
+
+ Instead of a file name you can also specify the URL
+ of a document to be parsed.
+
+ Use "-" as the file name to parse the standard
+ input stream.
+
+ Use the "--gui" (or "-g") option to start
+ the Apache Tika GUI. You can drag and drop files
+ from a normal file explorer to the GUI window to
+ extract text content and metadata from the files.
+---
+
+ The standalone jar is fully self-contained and should work wherever
+ a Java 5 (or higher) runtime environment is available.
+
+ You can also use the jar as a component in a Unix pipeline or
+ as an external tool in many scripting languages.
+
+---
+# Check if an Internet resource contains a specific keyword
+curl http://.../document.doc \
+ | java -jar tika-x.y-standalone.jar --text \
+ | grep -q keyword
+---
Modified: lucene/tika/trunk/src/site/apt/index.apt
URL: http://svn.apache.org/viewvc/lucene/tika/trunk/src/site/apt/index.apt?rev=721064&r1=721063&r2=721064&view=diff
==============================================================================
--- lucene/tika/trunk/src/site/apt/index.apt (original)
+++ lucene/tika/trunk/src/site/apt/index.apt Wed Nov 26 17:33:41 2008
@@ -24,9 +24,13 @@
libraries. For more information about Tika, please see the
{{{formats.html}list of supported document formats}} and the
{{{documentation.html}available documentation}}. You can find the
- latest release on the {{{download.html}download page}}.
+ latest release on the {{{download.html}download page}}. See the
+ {{{gettingstarted.html}Getting Started}} guide for instructions on
+ how to start using Tika.
Tika is a subproject of {{{http://lucene.apache.org/}Apache Lucene}}.
+ Lucene is a project of the
+ {{{http://www.apache.org/}Apache Software Foundation}}.
Latest News
Modified: lucene/tika/trunk/src/site/site.xml
URL: http://svn.apache.org/viewvc/lucene/tika/trunk/src/site/site.xml?rev=721064&r1=721063&r2=721064&view=diff
==============================================================================
--- lucene/tika/trunk/src/site/site.xml (original)
+++ lucene/tika/trunk/src/site/site.xml Wed Nov 26 17:33:41 2008
@@ -43,6 +43,7 @@
<item name="Download" href="download.html"/>
<item name="Documentation" href="documentation.html"/>
<item name="Supported Formats" href="formats.html"/>
+ <item name="Getting Started" href="gettingstarted.html"/>
</menu>
<menu ref="reports"/>
</body>