You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2011/10/11 00:12:20 UTC
svn commit: r1181271 [2/3] - in /tika/site/src/site/apt: ./ 0.10/ 0.5/ 0.6/
0.7/ 0.8/ 0.9/ 1.0/
Modified: tika/site/src/site/apt/0.6/formats.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.6/formats.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.6/formats.apt (original)
+++ tika/site/src/site/apt/0.6/formats.apt Mon Oct 10 22:12:19 2011
@@ -1,145 +1,145 @@
- --------------------------
- Supported Document Formats
- --------------------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements. See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License. You may obtain a copy of the License at
-~~
-~~ http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Supported Document Formats
-
- This page lists all the document formats supported by Apache Tika 0.6.
- Follow the links to the various parser class javadocs for more detailed
- information about each document format and how it is parsed by Tika.
-
-%{toc|section=1|fromDepth=1}
-
-* {HyperText Markup Language}
-
- The HyperText Markup Language (HTML) is the lingua franca of the web.
- Tika uses the {{{http://home.ccil.org/~cowan/XML/tagsoup/}TagSoup}}
- library to support virtually any kind of HTML found on the web.
- The output from the
- {{{./api/org/apache/tika/parser/html/HtmlParser.html}HtmlParser}} class
- is guaranteed to be well-formed and valid XHTML, and various heuristics
- are used to prevent things like inline scripts from cluttering the
- extracted text content.
-
-* {XML and derived formats}
-
- The Extensible Markup Language (XML) format is a generic format that can
- be used for all kinds of content. Tika has custom parsers for some widely
- used XML vocabularies like XHTML, OOXML and ODF, but the default
- {{{./api/org/apache/tika/parser/xml/DcXMLParser.html}DcXMLParser}}
- class simply extracts the text content of the document and ignores any XML
- structure. The only exception to this rule are Dublin Core metadata
- elements that are used for the document metadata.
-
-* {Microsoft Office document formats}
-
- Microsoft Office and some related applications produce documents in the
- generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The
- older OLE 2 format was introduced in Microsoft Office version 97 and was
- the default format until Office version 2007 and the new XML-based
- OOXML format. The
- {{{./api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}}
- and
- {{{./api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html}OOXMLParser}}
- classes use {{{http://poi.apache.org/}Apache POI}} libraries to support
- text and metadata extraction from both OLE2 and OOXML documents.
-
-* {OpenDocument Format}
-
- The OpenDocument format (ODF) is used most notably as the default format
- of the OpenOffice.org office suite. The
- {{{./api/org/apache/tika/parser/odf/OpenDocumentParser.html}OpenDocumentParser}}
- class supports this format and the earlier OpenOffice 1.0 format on which
- ODF is based.
-
-* {Portable Document Format}
-
- The {{{./api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class
- parsers Portable Document Format (PDF) documents using the
- {{{http://pdfbox.apache.org/}Apache PDFBox}} library.
-
-* {Electronic Publication Format}
-
- The {{{./api/org/apache/tika/parser/epub/EpubParser.html}EpubParser}} class
- supports the Electronic Publication Format (EPUB) used for many digital
- books.
-
-* {Rich Text Format}
-
- The {{{./api/org/apache/tika/parser/rtf/RTFParser.html}RTFParser}} class
- uses the standard javax.swing.text.rtf feature to extract text content
- from Rich Text Format (RTF) documents.
-
-* {Compression and packaging formats}
-
- Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}}
- library to support various compression and packaging formats. The
- {{{./api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}}
- class and its subclasses first parse the top level compression or
- packaging format and then pass the unpacked document streams to a
- second parsing stage using the parser instance specified in the
- parse context.
-
-* {Text formats}
-
- Extracting text content from plain text files seems like a simple task
- until you start thinking of all the possible character encodings. The
- {{{./api/org/apache/tika/parser/txt/TXTParser.html}TXTParser}} class uses
- encoding detection code from the {{{http://site.icu-project.org/}ICU}}
- project to automatically detect the character encoding of a text document.
-
-* {Audio formats}
-
- Tika can detect several common audio formats and extract metadata
- from them. Even text extraction is supported for some audio files that
- contain lyrics or other textual content. The
- {{{./api/org/apache/tika/parser/audio/AudioParser.html}AudioParser}}
- and {{{./api/org/apache/tika/parser/audio/MidiParser.html}MidiParser}}
- classes use standard javax.sound features to process simple audio
- formats, and the
- {{{./api/org/apache/tika/parser/mp3/Mp3Parser.html}Mp3Parser}} class
- adds support for the widely used MP3 format.
-
-* {Image formats}
-
- The {{{./api/org/apache/tika/parser/image/ImageParser.html}ImageParser}}
- class uses the standard javax.imageio feature to extract simple metadata
- from image formats supported by the Java platform. More complex image
- metadata is available through the
- {{{./api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class
- that uses the metadata-extractor library to supports Exif metadata
- extraction from Jpeg images.
-
-* {Video formats}
-
- Currently Tika only supports the Flash video format using a simple
- parsing algorithm implemented in the
- {{{./api/org/apache/tika/parser/flv/FLVParser}FLVParser}} class.
-
-* {Java class files and archives}
-
- The {{{./api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class
- extracts class names and method signatures from Java class files, and
- the {{{./api/org/apache/tika/parser/pkg/ZipParser.html}ZipParser}} class
- supports also jar archives.
-
-* {The mbox format}
-
- The {{{./api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can
- extract email messages from the mbox format used by many email archives
- and Unix-style mailboxes.
+ --------------------------
+ Supported Document Formats
+ --------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Supported Document Formats
+
+ This page lists all the document formats supported by Apache Tika 0.6.
+ Follow the links to the various parser class javadocs for more detailed
+ information about each document format and how it is parsed by Tika.
+
+%{toc|section=1|fromDepth=1}
+
+* {HyperText Markup Language}
+
+ The HyperText Markup Language (HTML) is the lingua franca of the web.
+ Tika uses the {{{http://home.ccil.org/~cowan/XML/tagsoup/}TagSoup}}
+ library to support virtually any kind of HTML found on the web.
+ The output from the
+ {{{./api/org/apache/tika/parser/html/HtmlParser.html}HtmlParser}} class
+ is guaranteed to be well-formed and valid XHTML, and various heuristics
+ are used to prevent things like inline scripts from cluttering the
+ extracted text content.
+
+* {XML and derived formats}
+
+ The Extensible Markup Language (XML) format is a generic format that can
+ be used for all kinds of content. Tika has custom parsers for some widely
+ used XML vocabularies like XHTML, OOXML and ODF, but the default
+ {{{./api/org/apache/tika/parser/xml/DcXMLParser.html}DcXMLParser}}
+ class simply extracts the text content of the document and ignores any XML
+ structure. The only exception to this rule are Dublin Core metadata
+ elements that are used for the document metadata.
+
+* {Microsoft Office document formats}
+
+ Microsoft Office and some related applications produce documents in the
+ generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The
+ older OLE 2 format was introduced in Microsoft Office version 97 and was
+ the default format until Office version 2007 and the new XML-based
+ OOXML format. The
+ {{{./api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}}
+ and
+ {{{./api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html}OOXMLParser}}
+ classes use {{{http://poi.apache.org/}Apache POI}} libraries to support
+ text and metadata extraction from both OLE2 and OOXML documents.
+
+* {OpenDocument Format}
+
+ The OpenDocument format (ODF) is used most notably as the default format
+ of the OpenOffice.org office suite. The
+ {{{./api/org/apache/tika/parser/odf/OpenDocumentParser.html}OpenDocumentParser}}
+ class supports this format and the earlier OpenOffice 1.0 format on which
+ ODF is based.
+
+* {Portable Document Format}
+
+ The {{{./api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class
+ parsers Portable Document Format (PDF) documents using the
+ {{{http://pdfbox.apache.org/}Apache PDFBox}} library.
+
+* {Electronic Publication Format}
+
+ The {{{./api/org/apache/tika/parser/epub/EpubParser.html}EpubParser}} class
+ supports the Electronic Publication Format (EPUB) used for many digital
+ books.
+
+* {Rich Text Format}
+
+ The {{{./api/org/apache/tika/parser/rtf/RTFParser.html}RTFParser}} class
+ uses the standard javax.swing.text.rtf feature to extract text content
+ from Rich Text Format (RTF) documents.
+
+* {Compression and packaging formats}
+
+ Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}}
+ library to support various compression and packaging formats. The
+ {{{./api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}}
+ class and its subclasses first parse the top level compression or
+ packaging format and then pass the unpacked document streams to a
+ second parsing stage using the parser instance specified in the
+ parse context.
+
+* {Text formats}
+
+ Extracting text content from plain text files seems like a simple task
+ until you start thinking of all the possible character encodings. The
+ {{{./api/org/apache/tika/parser/txt/TXTParser.html}TXTParser}} class uses
+ encoding detection code from the {{{http://site.icu-project.org/}ICU}}
+ project to automatically detect the character encoding of a text document.
+
+* {Audio formats}
+
+ Tika can detect several common audio formats and extract metadata
+ from them. Even text extraction is supported for some audio files that
+ contain lyrics or other textual content. The
+ {{{./api/org/apache/tika/parser/audio/AudioParser.html}AudioParser}}
+ and {{{./api/org/apache/tika/parser/audio/MidiParser.html}MidiParser}}
+ classes use standard javax.sound features to process simple audio
+ formats, and the
+ {{{./api/org/apache/tika/parser/mp3/Mp3Parser.html}Mp3Parser}} class
+ adds support for the widely used MP3 format.
+
+* {Image formats}
+
+ The {{{./api/org/apache/tika/parser/image/ImageParser.html}ImageParser}}
+ class uses the standard javax.imageio feature to extract simple metadata
+ from image formats supported by the Java platform. More complex image
+ metadata is available through the
+ {{{./api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class
+ that uses the metadata-extractor library to supports Exif metadata
+ extraction from Jpeg images.
+
+* {Video formats}
+
+ Currently Tika only supports the Flash video format using a simple
+ parsing algorithm implemented in the
+ {{{./api/org/apache/tika/parser/flv/FLVParser}FLVParser}} class.
+
+* {Java class files and archives}
+
+ The {{{./api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class
+ extracts class names and method signatures from Java class files, and
+ the {{{./api/org/apache/tika/parser/pkg/ZipParser.html}ZipParser}} class
+ supports also jar archives.
+
+* {The mbox format}
+
+ The {{{./api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can
+ extract email messages from the mbox format used by many email archives
+ and Unix-style mailboxes.
Propchange: tika/site/src/site/apt/0.6/formats.apt
------------------------------------------------------------------------------
svn:eol-style = native
Modified: tika/site/src/site/apt/0.6/gettingstarted.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.6/gettingstarted.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.6/gettingstarted.apt (original)
+++ tika/site/src/site/apt/0.6/gettingstarted.apt Mon Oct 10 22:12:19 2011
@@ -1,207 +1,207 @@
- --------------------------------
- Getting Started with Apache Tika
- --------------------------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements. See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License. You may obtain a copy of the License at
-~~
-~~ http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Getting Started with Apache Tika
-
- This document describes how to build Apache Tika from sources and
- how to start using Tika in an application.
-
-Getting and building the sources
-
- To build Tika from sources you first need to either
- {{{../download.html}download}} a source release or
- {{{../source-repository.html}checkout}} the latest sources from
- version control.
-
- Once you have the sources, you can build them using the
- {{{http://maven.apache.org/}Maven 2}} build system. Executing the
- following command in the base directory will build the sources
- and install the resulting artifacts in your local Maven repository.
-
----
-mvn install
----
-
- See the Maven documentation for more information about the available
- build options.
-
- Note that you need Java 5 or higher to build Tika.
-
-Build artifacts
-
- The Tika 0.6 build consists of a number of components and produces
- the following main binaries:
-
- [tika-core/target/tika-core-0.6.jar]
- Tika core library. Contains the core interfaces and classes of Tika,
- but none of the parser implementations. Depends only on Java 5.
-
- [tika-parsers/target/tika-parsers-0.6.jar]
- Tika parsers. Collection of classes that implement the Tika Parser
- interface based on various external parser libraries.
-
- [tika-app/target/tika-app-0.6.jar]
- Tika application. Combines the above libraries and all the external
- parser libraries into a single runnable jar with a GUI and a command
- line interface.
-
- [tika-bundle/target/tika-bundle-0.6.jar]
- Tika bundle. An OSGi bundle that includes everything you need to use all
- Tika functionality in an OSGi environment.
-
-Using Tika as a Maven dependency
-
- The core library, tika-core, contains the key interfaces and classes of Tika
- and can be used by itself if you don't need the full set of parsers from
- the tika-parsers component. The tika-core dependency looks like this:
-
----
- <dependency>
- <groupId>org.apache.tika</groupId>
- <artifactId>tika-core</artifactId>
- <version>0.6</version>
- </dependency>
----
-
- If you want to use Tika to parse documents (instead of simply detecting
- document types, etc.), you'll want to depend on tika-parsers instead:
-
----
- <dependency>
- <groupId>org.apache.tika</groupId>
- <artifactId>tika-parsers</artifactId>
- <version>0.6</version>
- </dependency>
----
-
- Note that adding this dependency will introduce a number of
- transitive dependencies to your project, including one on tika-core.
- You need to make sure that these dependencies won't conflict with your
- existing project dependencies. The listing below shows all the
- compile-scope dependencies of tika-parsers in the Tika 0.6 release.
-
----
-org.apache.tika:tika-parsers:bundle:0.6
-+- org.apache.tika:tika-core:jar:0.6:compile
-+- org.apache.commons:commons-compress:jar:1.0:compile
-+- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:compile
-| +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:compile
-| \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:compile
-+- org.apache.poi:poi:jar:3.6:compile
-+- org.apache.poi:poi-scratchpad:jar:3.6:compile
-+- org.apache.poi:poi-ooxml:jar:3.6:compile
-| +- org.apache.poi:poi-ooxml-schemas:jar:3.6:compile
-| | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
-| \- dom4j:dom4j:jar:1.6.1:compile
-| \- xml-apis:xml-apis:jar:1.0.b2:compile
-+- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
-+- commons-logging:commons-logging:jar:1.1.1:compile
-+- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile
-+- asm:asm:jar:3.1:compile
-+- log4j:log4j:jar:1.2.14:compile
-\- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
----
-
-Using Tika in an Ant project
-
- Unless you use a dependency manager tool like
- {{{http://ant.apache.org/ivy/}Apache Ivy}}, to use Tika in you application
- you can include the Tika jar files and the dependencies individually.
-
----
-<classpath>
- ... <!-- your other classpath entries -->
- <pathelement location="path/to/tika-core-0.6.jar"/>
- <pathelement location="path/to/tika-parsers-0.6.jar"/>
- <pathelement location="path/to/commons-logging-1.1.1.jar"/>
- <pathelement location="path/to/commons-compress-1.0.jar"/>
- <pathelement location="path/to/pdfbox-0.8.0-incubating.jar"/>
- <pathelement location="path/to/fontbox-0.8.0-incubator.jar"/>
- <pathelement location="path/to/jempbox-0.8.0-incubator.jar"/>
- <pathelement location="path/to/poi-3.6.jar"/>
- <pathelement location="path/to/poi-scratchpad-3.6.jar"/>
- <pathelement location="path/to/poi-ooxml-3.6.jar"/>
- <pathelement location="path/to/poi-ooxml-schemas-3.6.jar"/>
- <pathelement location="path/to/xmlbeans-2.3.0.jar"/>
- <pathelement location="path/to/dom4j-1.6.1.jar"/>
- <pathelement location="path/to/xml-apis-1.0.b2.jar"/>
- <pathelement location="path/to/geronimo-stax-api_1.0_spec-1.0.jar"/>
- <pathelement location="path/to/tagsoup-1.2.jar"/>
- <pathelement location="path/to/asm-3.1.jar"/>
- <pathelement location="path/to/log4j-1.2.14.jar"/>
- <pathelement location="path/to/metadata-extractor-2.4.0-beta-1.jar"/>
-</classpath>
----
-
- An easy way to gather all these libraries is to run
- "mvn dependency:copy-dependencies" in the tika-parsers source directory.
- This will copy all Tika dependencies to the <<<target/dependencies>>>
- directory.
-
- Alternatively you can simply drop the entire tika-app jar to your
- classpath to get all of the above dependencies in a single archive.
-
-Using Tika as a command line utility
-
- The Tika application jar (tika-app-0.6.jar) can be used as a command
- line utility for extracting text content and metadata from all sorts of
- files. This runnable jar contains all the dependencies it needs, so
- you don't need to worry about classpath settings to run it.
-
- The usage instructions are shown below.
-
----
-usage: java -jar tika-app-0.6.jar [option] [file]
-
-Options:
- -? or --help Print this usage message
- -v or --verbose Print debug level messages
- -g or --gui Start the Apache Tika GUI
- -x or --xml Output XHTML content (default)
- -h or --html Output HTML content
- -t or --text Output plain text content
- -m or --metadata Output only metadata
-
-Description:
- Apache Tika will parse the file(s) specified on the
- command line and output the extracted text content
- or metadata to standard output.
-
- Instead of a file name you can also specify the URL
- of a document to be parsed.
-
- If no file name or URL is specified (or the special
- name "-" is used), then the standard input stream
- is parsed.
-
- Use the "--gui" (or "-g") option to start
- the Apache Tika GUI. You can drag and drop files
- from a normal file explorer to the GUI window to
- extract text content and metadata from the files.
----
-
- You can also use the jar as a component in a Unix pipeline or
- as an external tool in many scripting languages.
-
----
-# Check if an Internet resource contains a specific keyword
-curl http://.../document.doc \
- | java -jar tika-app-0.6.jar --text \
- | grep -q keyword
----
+ --------------------------------
+ Getting Started with Apache Tika
+ --------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Getting Started with Apache Tika
+
+ This document describes how to build Apache Tika from sources and
+ how to start using Tika in an application.
+
+Getting and building the sources
+
+ To build Tika from sources you first need to either
+ {{{../download.html}download}} a source release or
+ {{{../source-repository.html}checkout}} the latest sources from
+ version control.
+
+ Once you have the sources, you can build them using the
+ {{{http://maven.apache.org/}Maven 2}} build system. Executing the
+ following command in the base directory will build the sources
+ and install the resulting artifacts in your local Maven repository.
+
+---
+mvn install
+---
+
+ See the Maven documentation for more information about the available
+ build options.
+
+ Note that you need Java 5 or higher to build Tika.
+
+Build artifacts
+
+ The Tika 0.6 build consists of a number of components and produces
+ the following main binaries:
+
+ [tika-core/target/tika-core-0.6.jar]
+ Tika core library. Contains the core interfaces and classes of Tika,
+ but none of the parser implementations. Depends only on Java 5.
+
+ [tika-parsers/target/tika-parsers-0.6.jar]
+ Tika parsers. Collection of classes that implement the Tika Parser
+ interface based on various external parser libraries.
+
+ [tika-app/target/tika-app-0.6.jar]
+ Tika application. Combines the above libraries and all the external
+ parser libraries into a single runnable jar with a GUI and a command
+ line interface.
+
+ [tika-bundle/target/tika-bundle-0.6.jar]
+ Tika bundle. An OSGi bundle that includes everything you need to use all
+ Tika functionality in an OSGi environment.
+
+Using Tika as a Maven dependency
+
+ The core library, tika-core, contains the key interfaces and classes of Tika
+ and can be used by itself if you don't need the full set of parsers from
+ the tika-parsers component. The tika-core dependency looks like this:
+
+---
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-core</artifactId>
+ <version>0.6</version>
+ </dependency>
+---
+
+ If you want to use Tika to parse documents (instead of simply detecting
+ document types, etc.), you'll want to depend on tika-parsers instead:
+
+---
+ <dependency>
+ <groupId>org.apache.tika</groupId>
+ <artifactId>tika-parsers</artifactId>
+ <version>0.6</version>
+ </dependency>
+---
+
+ Note that adding this dependency will introduce a number of
+ transitive dependencies to your project, including one on tika-core.
+ You need to make sure that these dependencies won't conflict with your
+ existing project dependencies. The listing below shows all the
+ compile-scope dependencies of tika-parsers in the Tika 0.6 release.
+
+---
+org.apache.tika:tika-parsers:bundle:0.6
++- org.apache.tika:tika-core:jar:0.6:compile
++- org.apache.commons:commons-compress:jar:1.0:compile
++- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:compile
+| +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:compile
+| \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:compile
++- org.apache.poi:poi:jar:3.6:compile
++- org.apache.poi:poi-scratchpad:jar:3.6:compile
++- org.apache.poi:poi-ooxml:jar:3.6:compile
+| +- org.apache.poi:poi-ooxml-schemas:jar:3.6:compile
+| | \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
+| \- dom4j:dom4j:jar:1.6.1:compile
+| \- xml-apis:xml-apis:jar:1.0.b2:compile
++- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
++- commons-logging:commons-logging:jar:1.1.1:compile
++- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile
++- asm:asm:jar:3.1:compile
++- log4j:log4j:jar:1.2.14:compile
+\- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
+---
+
+Using Tika in an Ant project
+
+ Unless you use a dependency manager tool like
+ {{{http://ant.apache.org/ivy/}Apache Ivy}}, to use Tika in you application
+ you can include the Tika jar files and the dependencies individually.
+
+---
+<classpath>
+ ... <!-- your other classpath entries -->
+ <pathelement location="path/to/tika-core-0.6.jar"/>
+ <pathelement location="path/to/tika-parsers-0.6.jar"/>
+ <pathelement location="path/to/commons-logging-1.1.1.jar"/>
+ <pathelement location="path/to/commons-compress-1.0.jar"/>
+ <pathelement location="path/to/pdfbox-0.8.0-incubating.jar"/>
+ <pathelement location="path/to/fontbox-0.8.0-incubator.jar"/>
+ <pathelement location="path/to/jempbox-0.8.0-incubator.jar"/>
+ <pathelement location="path/to/poi-3.6.jar"/>
+ <pathelement location="path/to/poi-scratchpad-3.6.jar"/>
+ <pathelement location="path/to/poi-ooxml-3.6.jar"/>
+ <pathelement location="path/to/poi-ooxml-schemas-3.6.jar"/>
+ <pathelement location="path/to/xmlbeans-2.3.0.jar"/>
+ <pathelement location="path/to/dom4j-1.6.1.jar"/>
+ <pathelement location="path/to/xml-apis-1.0.b2.jar"/>
+ <pathelement location="path/to/geronimo-stax-api_1.0_spec-1.0.jar"/>
+ <pathelement location="path/to/tagsoup-1.2.jar"/>
+ <pathelement location="path/to/asm-3.1.jar"/>
+ <pathelement location="path/to/log4j-1.2.14.jar"/>
+ <pathelement location="path/to/metadata-extractor-2.4.0-beta-1.jar"/>
+</classpath>
+---
+
+ An easy way to gather all these libraries is to run
+ "mvn dependency:copy-dependencies" in the tika-parsers source directory.
+ This will copy all Tika dependencies to the <<<target/dependencies>>>
+ directory.
+
+ Alternatively you can simply drop the entire tika-app jar to your
+ classpath to get all of the above dependencies in a single archive.
+
+Using Tika as a command line utility
+
+ The Tika application jar (tika-app-0.6.jar) can be used as a command
+ line utility for extracting text content and metadata from all sorts of
+ files. This runnable jar contains all the dependencies it needs, so
+ you don't need to worry about classpath settings to run it.
+
+ The usage instructions are shown below.
+
+---
+usage: java -jar tika-app-0.6.jar [option] [file]
+
+Options:
+ -? or --help Print this usage message
+ -v or --verbose Print debug level messages
+ -g or --gui Start the Apache Tika GUI
+ -x or --xml Output XHTML content (default)
+ -h or --html Output HTML content
+ -t or --text Output plain text content
+ -m or --metadata Output only metadata
+
+Description:
+ Apache Tika will parse the file(s) specified on the
+ command line and output the extracted text content
+ or metadata to standard output.
+
+ Instead of a file name you can also specify the URL
+ of a document to be parsed.
+
+ If no file name or URL is specified (or the special
+ name "-" is used), then the standard input stream
+ is parsed.
+
+ Use the "--gui" (or "-g") option to start
+ the Apache Tika GUI. You can drag and drop files
+ from a normal file explorer to the GUI window to
+ extract text content and metadata from the files.
+---
+
+ You can also use the jar as a component in a Unix pipeline or
+ as an external tool in many scripting languages.
+
+---
+# Check if an Internet resource contains a specific keyword
+curl http://.../document.doc \
+ | java -jar tika-app-0.6.jar --text \
+ | grep -q keyword
+---
Propchange: tika/site/src/site/apt/0.6/gettingstarted.apt
------------------------------------------------------------------------------
svn:eol-style = native
Modified: tika/site/src/site/apt/0.6/index.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.6/index.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.6/index.apt (original)
+++ tika/site/src/site/apt/0.6/index.apt Mon Oct 10 22:12:19 2011
@@ -1,112 +1,112 @@
- ---------------
- Apache Tika 0.6
- ---------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements. See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License. You may obtain a copy of the License at
-~~
-~~ http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Apache Tika 0.6
-
- The most notable changes in Tika 0.6 over the previous release are:
-
- * Mime-type detection for HTML (and all types) has been improved,
- allowing malformed HTML files and those HTML files that require
- a bit more observed content before the type is properly detected,
- are now correctly identified by the AutoDetectParser.
- ({{{https://issues.apache.org/jira/browse/TIKA-327}TIKA-327}},
- {{{https://issues.apache.org/jira/browse/TIKA-357}TIKA-357}},
- {{{https://issues.apache.org/jira/browse/TIKA-366}TIKA-366}},
- {{{https://issues.apache.org/jira/browse/TIKA-367}TIKA-367}})
-
- * Tika now has an additional OSGi bundle packaging that includes all
- the required parser libraries. This bundle package makes it easy to
- use all Tika features in an OSGi environment.
- ({{{https://issues.apache.org/jira/browse/TIKA-340}TIKA-340}},
- {{{https://issues.apache.org/jira/browse/TIKA-342}TIKA-342}})
-
- * The Apache POI dependency used for parsing Microsoft Office file
- formats has been upgraded to version 3.6. The most visible
- improvement in this version is the notably reduced ooxml jar file
- size. The tika-app jar size is now down to 15MB from the 25MB in
- Tika 0.5.
- ({{{https://issues.apache.org/jira/browse/TIKA-353}TIKA-353}})
-
- * Handling of character encoding information in input metadata and
- HTML \<meta\> tags has been improved. When no applicable encoding
- information is available, the encoding is detected by looking at
- the input data.
- ({{{https://issues.apache.org/jira/browse/TIKA-332}TIKA-332}},
- {{{https://issues.apache.org/jira/browse/TIKA-334}TIKA-334}},
- {{{https://issues.apache.org/jira/browse/TIKA-335}TIKA-335}},
- {{{https://issues.apache.org/jira/browse/TIKA-341}TIKA-341}})
-
- * Some document types like Excel spreadsheets contain content like
- numbers or formulas whose exact text format depends on the current
- locale. So far Tika has used the platform default locale in such
- cases, but clients can now explicitly specify the locale by passing
- a Locale instance in the parse context.
- ({{{https://issues.apache.org/jira/browse/TIKA-125}TIKA-125}})
-
- * The default text output encoding of the tika-app jar is now UTF-8
- when running on Mac OS X. This is because the default encoding used
- by Java is not compatible with the console application in Mac OS X.
- On all other platforms the text output from tika-app still uses
- the platform default encoding.
- ({{{https://issues.apache.org/jira/browse/TIKA-324}TIKA-324}})
-
- * A flash video (video/x-flv) parser has been added.
- ({{{https://issues.apache.org/jira/browse/TIKA-328}TIKA-328}})
-
- * The handling of Number and Date cell formatting within the
- Microsoft Excel documents has been added. This include currencies,
- percentages and scientific formats.
- ({{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}})
-
- The following people have contributed to Tika 0.6 by submitting or
- commenting on the issues resolved in this release:
-
- * Andrzej Bialecki
-
- * Bertrand Delacretaz
-
- * Chris A. Mattmann
-
- * Dave Meikle
-
- * Erik Hetzner
-
- * Felix Meschberger
-
- * Jukka Zitting
-
- * Julien Nioche
-
- * Ken Krugler
-
- * Luke Nezda
-
- * Maxim Valyanskiy
-
- * Niall Pemberton
-
- * Peter Wolanin
-
- * Piotr B.
-
- * Sami Siren
-
- * Yuan-Fang Li
-
- See {{http://tinyurl.com/yc3dk67}} for more details on these contributions.
+ ---------------
+ Apache Tika 0.6
+ ---------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika 0.6
+
+ The most notable changes in Tika 0.6 over the previous release are:
+
+ * Mime-type detection for HTML (and all types) has been improved,
+ allowing malformed HTML files and those HTML files that require
+ a bit more observed content before the type is properly detected,
+ are now correctly identified by the AutoDetectParser.
+ ({{{https://issues.apache.org/jira/browse/TIKA-327}TIKA-327}},
+ {{{https://issues.apache.org/jira/browse/TIKA-357}TIKA-357}},
+ {{{https://issues.apache.org/jira/browse/TIKA-366}TIKA-366}},
+ {{{https://issues.apache.org/jira/browse/TIKA-367}TIKA-367}})
+
+ * Tika now has an additional OSGi bundle packaging that includes all
+ the required parser libraries. This bundle package makes it easy to
+ use all Tika features in an OSGi environment.
+ ({{{https://issues.apache.org/jira/browse/TIKA-340}TIKA-340}},
+ {{{https://issues.apache.org/jira/browse/TIKA-342}TIKA-342}})
+
+ * The Apache POI dependency used for parsing Microsoft Office file
+ formats has been upgraded to version 3.6. The most visible
+ improvement in this version is the notably reduced ooxml jar file
+ size. The tika-app jar size is now down to 15MB from the 25MB in
+ Tika 0.5.
+ ({{{https://issues.apache.org/jira/browse/TIKA-353}TIKA-353}})
+
+ * Handling of character encoding information in input metadata and
+ HTML \<meta\> tags has been improved. When no applicable encoding
+ information is available, the encoding is detected by looking at
+ the input data.
+ ({{{https://issues.apache.org/jira/browse/TIKA-332}TIKA-332}},
+ {{{https://issues.apache.org/jira/browse/TIKA-334}TIKA-334}},
+ {{{https://issues.apache.org/jira/browse/TIKA-335}TIKA-335}},
+ {{{https://issues.apache.org/jira/browse/TIKA-341}TIKA-341}})
+
+ * Some document types like Excel spreadsheets contain content like
+ numbers or formulas whose exact text format depends on the current
+ locale. So far Tika has used the platform default locale in such
+ cases, but clients can now explicitly specify the locale by passing
+ a Locale instance in the parse context.
+ ({{{https://issues.apache.org/jira/browse/TIKA-125}TIKA-125}})
+
+ * The default text output encoding of the tika-app jar is now UTF-8
+ when running on Mac OS X. This is because the default encoding used
+ by Java is not compatible with the console application in Mac OS X.
+ On all other platforms the text output from tika-app still uses
+ the platform default encoding.
+ ({{{https://issues.apache.org/jira/browse/TIKA-324}TIKA-324}})
+
+ * A flash video (video/x-flv) parser has been added.
+ ({{{https://issues.apache.org/jira/browse/TIKA-328}TIKA-328}})
+
+ * The handling of Number and Date cell formatting within the
+ Microsoft Excel documents has been added. This include currencies,
+ percentages and scientific formats.
+ ({{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}})
+
+ The following people have contributed to Tika 0.6 by submitting or
+ commenting on the issues resolved in this release:
+
+ * Andrzej Bialecki
+
+ * Bertrand Delacretaz
+
+ * Chris A. Mattmann
+
+ * Dave Meikle
+
+ * Erik Hetzner
+
+ * Felix Meschberger
+
+ * Jukka Zitting
+
+ * Julien Nioche
+
+ * Ken Krugler
+
+ * Luke Nezda
+
+ * Maxim Valyanskiy
+
+ * Niall Pemberton
+
+ * Peter Wolanin
+
+ * Piotr B.
+
+ * Sami Siren
+
+ * Yuan-Fang Li
+
+ See {{http://tinyurl.com/yc3dk67}} for more details on these contributions.
Propchange: tika/site/src/site/apt/0.6/index.apt
------------------------------------------------------------------------------
svn:eol-style = native
Modified: tika/site/src/site/apt/0.6/parser.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.6/parser.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.6/parser.apt (original)
+++ tika/site/src/site/apt/0.6/parser.apt Mon Oct 10 22:12:19 2011
@@ -1,245 +1,245 @@
- --------------------
- The Parser interface
- --------------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements. See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License. You may obtain a copy of the License at
-~~
-~~ http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-The Parser interface
-
- The
- {{{./api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
- interface is the key concept of Apache Tika. It hides the complexity of
- different file formats and parsing libraries while providing a simple and
- powerful mechanism for client applications to extract structured text
- content and metadata from all sorts of documents. All this is achieved
- with a single method:
-
----
-void parse(
- InputStream stream, ContentHandler handler, Metadata metadata,
- ParseContext context) throws IOException, SAXException, TikaException;
----
-
- The <<<parse>>> method takes the document to be parsed and related metadata
- as input and outputs the results as XHTML SAX events and extra metadata.
- The parse context argument is used to specify context information (like
- the current local) that is not related to any individual document.
- The main criteria that lead to this design were:
-
- [Streamed parsing] The interface should require neither the client
- application nor the parser implementation to keep the full document
- content in memory or spooled to disk. This allows even huge documents
- to be parsed without excessive resource requirements.
-
- [Structured content] A parser implementation should be able to
- include structural information (headings, links, etc.) in the extracted
- content. A client application can use this information for example to
- better judge the relevance of different parts of the parsed document.
-
- [Input metadata] A client application should be able to include metadata
- like the file name or declared content type with the document to be
- parsed. The parser implementation can use this information to better
- guide the parsing process.
-
- [Output metadata] A parser implementation should be able to return
- document metadata in addition to document content. Many document
- formats contain metadata like the name of the author that may be useful
- to client applications.
-
- [Context sensitivity] While the default settings and behaviour of Tika
- parsers should work well for most use cases, there are still situations
- where more fine-grained control over the parsing process is desirable.
- It should be easy to inject such context-specific information to the
- parsing process without breaking the layers of abstraction.
-
- []
-
- These criteria are reflected in the arguments of the <<<parse>>> method.
-
-* Document input stream
-
- The first argument is an
- {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
- for reading the document to be parsed.
-
- If this document stream can not be read, then parsing stops and the thrown
- {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
- is passed up to the client application. If the stream can be read but
- not parsed (for example if the document is corrupted), then the parser
- throws a {{{./api/org/apache/tika/exception/TikaException.html}TikaException}}.
-
- The parser implementation will consume this stream but <will not close it>.
- Closing the stream is the responsibility of the client application that
- opened it in the first place. The recommended pattern for using streams
- with the <<<parse>>> method is:
-
----
-InputStream stream = ...; // open the stream
-try {
- parser.parse(stream, ...); // parse the stream
-} finally {
- stream.close(); // close the stream
-}
----
-
- Some document formats like the OLE2 Compound Document Format used by
- Microsoft Office are best parsed as random access files. In such cases the
- content of the input stream is automatically spooled to a temporary file
- that gets removed once parsed. A future version of Tika may make it possible
- to avoid this extra file if the input document is already a file in the
- local file system. See
- {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
- of this feature request.
-
-* XHTML SAX events
-
- The parsed content of the document stream is returned to the client
- application as a sequence of XHTML SAX events. XHTML is used to express
- structured content of the document and SAX events enable streamed
- processing. Note that the XHTML format is used here only to convey
- structural information, not to render the documents for browsing!
-
- The XHTML SAX events produced by the parser implementation are sent to a
- {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
- instance given to the <<<parse>>> method. If this the content handler
- fails to process an event, then parsing stops and the thrown
- {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
- is passed up to the client application.
-
- The overall structure of the generated event stream is (with indenting
- added for clarity):
-
----
-<html xmlns="http://www.w3.org/1999/xhtml">
- <head>
- <title>...</title>
- </head>
- <body>
- ...
- </body>
-</html>
----
-
- Parser implementations typically use the
- {{{./api/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
- utility class to generate the XHTML output.
-
- Dealing with the raw SAX events can be a bit complex, so Apache Tika
- comes with a number of utility classes that can be used to process and
- convert the event stream to other representations.
-
- For example, the
- {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
- class can be used to extract just the body part of the XHTML output and
- feed it either as SAX events to another content handler or as characters
- to an output stream, a writer, or simply a string. The following code
- snippet parses a document from the standard input stream and outputs the
- extracted text content to standard output:
-
----
-ContentHandler handler = new BodyContentHandler(System.out);
-parser.parse(System.in, handler, ...);
----
-
- Another useful class is
- {{{./api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
- uses a background thread to parse the document and returns the extracted
- text content as a character stream:
-
----
-InputStream stream = ...; // the document to be parsed
-Reader reader = new ParsingReader(parser, stream, ...);
-try {
- ...; // read the document text using the reader
-} finally {
- reader.close(); // the document stream is closed automatically
-}
----
-
-* Document metadata
-
- The third argument to the <<<parse>>> method is used to pass document
- metadata both in and out of the parser. Document metadata is expressed
- as an {{{./api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
-
- The following are some of the more interesting metadata properties:
-
- [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
- the document.
-
- A client application can set this property to allow the parser to use
- file name heuristics to determine the format of the document.
-
- The parser implementation may set this property if the file format
- contains the canonical name of the file (for example the Gzip format
- has a slot for the file name).
-
- [Metadata.CONTENT_TYPE] The declared content type of the document.
-
- A client application can set this property based on for example a HTTP
- Content-Type header. The declared content type may help the parser to
- correctly interpret the document.
-
- The parser implementation sets this property to the content type according
- to which the document was parsed.
-
- [Metadata.TITLE] The title of the document.
-
- The parser implementation sets this property if the document format
- contains an explicit title field.
-
- [Metadata.AUTHOR] The name of the author of the document.
-
- The parser implementation sets this property if the document format
- contains an explicit author field.
-
- []
-
- Note that metadata handling is still being discussed by the Tika development
- team, and it is likely that there will be some (backwards incompatible)
- changes in metadata handling before Tika 1.0.
-
-* Parse context
-
- The final argument to the <<<parse>>> method is used to inject
- context-specific information to the parsing process. This is useful
- for example when dealing with locale-specific date and number formats
- in Microsoft Excel spreadsheets. Another important use of the parse
- context is passing in the delegate parser instance to be used by
- two-phase parsers like the
- {{{./api/org/apache/parser/pkg/PackageParser.html}PackageParser}} subclasses.
- Some parser classes allow customization of the parsing process through
- strategy objects in the parse context.
-
-* Parser implementations
-
- Apache Tika comes with a number of parser classes for parsing
- {{{./formats.html}various document formats}}. You can also extend Tika
- with your own parsers, and of course any contributions to Tika are
- warmly welcome.
-
- The goal of Tika is to reuse existing parser libraries like
- {{{http://www.pdfbox.org/}PDFBox}} or
- {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
- of the parser classes in Tika are adapters to such external libraries.
-
- Tika also contains some general purpose parser implementations that are
- not targeted at any specific document formats. The most notable of these
- is the {{{./api/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
- class that encapsulates all Tika functionality into a single parser that
- can handle any types of documents. This parser will automatically determine
- the type of the incoming document based on various heuristics and will then
- parse the document accordingly.
+ --------------------
+ The Parser interface
+ --------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+The Parser interface
+
+ The
+ {{{./api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
+ interface is the key concept of Apache Tika. It hides the complexity of
+ different file formats and parsing libraries while providing a simple and
+ powerful mechanism for client applications to extract structured text
+ content and metadata from all sorts of documents. All this is achieved
+ with a single method:
+
+---
+void parse(
+ InputStream stream, ContentHandler handler, Metadata metadata,
+ ParseContext context) throws IOException, SAXException, TikaException;
+---
+
+ The <<<parse>>> method takes the document to be parsed and related metadata
+ as input and outputs the results as XHTML SAX events and extra metadata.
+ The parse context argument is used to specify context information (like
+ the current local) that is not related to any individual document.
+ The main criteria that lead to this design were:
+
+ [Streamed parsing] The interface should require neither the client
+ application nor the parser implementation to keep the full document
+ content in memory or spooled to disk. This allows even huge documents
+ to be parsed without excessive resource requirements.
+
+ [Structured content] A parser implementation should be able to
+ include structural information (headings, links, etc.) in the extracted
+ content. A client application can use this information for example to
+ better judge the relevance of different parts of the parsed document.
+
+ [Input metadata] A client application should be able to include metadata
+ like the file name or declared content type with the document to be
+ parsed. The parser implementation can use this information to better
+ guide the parsing process.
+
+ [Output metadata] A parser implementation should be able to return
+ document metadata in addition to document content. Many document
+ formats contain metadata like the name of the author that may be useful
+ to client applications.
+
+ [Context sensitivity] While the default settings and behaviour of Tika
+ parsers should work well for most use cases, there are still situations
+ where more fine-grained control over the parsing process is desirable.
+ It should be easy to inject such context-specific information to the
+ parsing process without breaking the layers of abstraction.
+
+ []
+
+ These criteria are reflected in the arguments of the <<<parse>>> method.
+
+* Document input stream
+
+ The first argument is an
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
+ for reading the document to be parsed.
+
+ If this document stream can not be read, then parsing stops and the thrown
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
+ is passed up to the client application. If the stream can be read but
+ not parsed (for example if the document is corrupted), then the parser
+ throws a {{{./api/org/apache/tika/exception/TikaException.html}TikaException}}.
+
+ The parser implementation will consume this stream but <will not close it>.
+ Closing the stream is the responsibility of the client application that
+ opened it in the first place. The recommended pattern for using streams
+ with the <<<parse>>> method is:
+
+---
+InputStream stream = ...; // open the stream
+try {
+ parser.parse(stream, ...); // parse the stream
+} finally {
+ stream.close(); // close the stream
+}
+---
+
+ Some document formats like the OLE2 Compound Document Format used by
+ Microsoft Office are best parsed as random access files. In such cases the
+ content of the input stream is automatically spooled to a temporary file
+ that gets removed once parsed. A future version of Tika may make it possible
+ to avoid this extra file if the input document is already a file in the
+ local file system. See
+ {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
+ of this feature request.
+
+* XHTML SAX events
+
+ The parsed content of the document stream is returned to the client
+ application as a sequence of XHTML SAX events. XHTML is used to express
+ structured content of the document and SAX events enable streamed
+ processing. Note that the XHTML format is used here only to convey
+ structural information, not to render the documents for browsing!
+
+ The XHTML SAX events produced by the parser implementation are sent to a
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+ instance given to the <<<parse>>> method. If this the content handler
+ fails to process an event, then parsing stops and the thrown
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
+ is passed up to the client application.
+
+ The overall structure of the generated event stream is (with indenting
+ added for clarity):
+
+---
+<html xmlns="http://www.w3.org/1999/xhtml">
+ <head>
+ <title>...</title>
+ </head>
+ <body>
+ ...
+ </body>
+</html>
+---
+
+ Parser implementations typically use the
+ {{{./api/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
+ utility class to generate the XHTML output.
+
+ Dealing with the raw SAX events can be a bit complex, so Apache Tika
+ comes with a number of utility classes that can be used to process and
+ convert the event stream to other representations.
+
+ For example, the
+ {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+ class can be used to extract just the body part of the XHTML output and
+ feed it either as SAX events to another content handler or as characters
+ to an output stream, a writer, or simply a string. The following code
+ snippet parses a document from the standard input stream and outputs the
+ extracted text content to standard output:
+
+---
+ContentHandler handler = new BodyContentHandler(System.out);
+parser.parse(System.in, handler, ...);
+---
+
+ Another useful class is
+ {{{./api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
+ uses a background thread to parse the document and returns the extracted
+ text content as a character stream:
+
+---
+InputStream stream = ...; // the document to be parsed
+Reader reader = new ParsingReader(parser, stream, ...);
+try {
+ ...; // read the document text using the reader
+} finally {
+ reader.close(); // the document stream is closed automatically
+}
+---
+
+* Document metadata
+
+ The third argument to the <<<parse>>> method is used to pass document
+ metadata both in and out of the parser. Document metadata is expressed
+ as an {{{./api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
+
+ The following are some of the more interesting metadata properties:
+
+ [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
+ the document.
+
+ A client application can set this property to allow the parser to use
+ file name heuristics to determine the format of the document.
+
+ The parser implementation may set this property if the file format
+ contains the canonical name of the file (for example the Gzip format
+ has a slot for the file name).
+
+ [Metadata.CONTENT_TYPE] The declared content type of the document.
+
+ A client application can set this property based on for example a HTTP
+ Content-Type header. The declared content type may help the parser to
+ correctly interpret the document.
+
+ The parser implementation sets this property to the content type according
+ to which the document was parsed.
+
+ [Metadata.TITLE] The title of the document.
+
+ The parser implementation sets this property if the document format
+ contains an explicit title field.
+
+ [Metadata.AUTHOR] The name of the author of the document.
+
+ The parser implementation sets this property if the document format
+ contains an explicit author field.
+
+ []
+
+ Note that metadata handling is still being discussed by the Tika development
+ team, and it is likely that there will be some (backwards incompatible)
+ changes in metadata handling before Tika 1.0.
+
+* Parse context
+
+ The final argument to the <<<parse>>> method is used to inject
+ context-specific information to the parsing process. This is useful
+ for example when dealing with locale-specific date and number formats
+ in Microsoft Excel spreadsheets. Another important use of the parse
+ context is passing in the delegate parser instance to be used by
+ two-phase parsers like the
+ {{{./api/org/apache/parser/pkg/PackageParser.html}PackageParser}} subclasses.
+ Some parser classes allow customization of the parsing process through
+ strategy objects in the parse context.
+
+* Parser implementations
+
+ Apache Tika comes with a number of parser classes for parsing
+ {{{./formats.html}various document formats}}. You can also extend Tika
+ with your own parsers, and of course any contributions to Tika are
+ warmly welcome.
+
+ The goal of Tika is to reuse existing parser libraries like
+ {{{http://www.pdfbox.org/}PDFBox}} or
+ {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
+ of the parser classes in Tika are adapters to such external libraries.
+
+ Tika also contains some general purpose parser implementations that are
+ not targeted at any specific document formats. The most notable of these
+ is the {{{./api/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
+ class that encapsulates all Tika functionality into a single parser that
+ can handle any types of documents. This parser will automatically determine
+ the type of the incoming document based on various heuristics and will then
+ parse the document accordingly.
Propchange: tika/site/src/site/apt/0.6/parser.apt
------------------------------------------------------------------------------
svn:eol-style = native
Modified: tika/site/src/site/apt/0.7/detection.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.7/detection.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.7/detection.apt (original)
+++ tika/site/src/site/apt/0.7/detection.apt Mon Oct 10 22:12:19 2011
@@ -1,152 +1,152 @@
- -----------------
- Content Detection
- -----------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements. See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License. You may obtain a copy of the License at
-~~
-~~ http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Content Detection
-
- This page gives you information on how content and language detection
- works with Apache Tika, and how to tune the behaviour of Tika.
-
-%{toc|section=1|fromDepth=1}
-
-* {The Detector Interface}
-
- The
- {{{./api/org/apache/tika/detect/Detector.html}org.apache.tika.detect.Detector}}
- interface is the basis for most of the content type detection in Apache
- Tika. All the different ways of detecting content all implement the
- same common method:
-
----
-MediaType detect(java.io.InputStream input,
- Metadata metadata) throws java.io.IOException
----
-
- The <<<detect>>> method takes the stream to inspect, and a
- <<<Metadata>>> object that holds any additional information on
- the content. The detector will return a
- {{{./api/org/apache/tika/mime/MediaType.html}MediaType}} object describing
- its best guess as to the type of the file.
-
- In general, only two keys on the Metadata object are used by Detectors.
- These are <<<Metadata.RESOURCE_NAME_KEY>>> which should hold the name
- of the file (where known), and <<<Metadata.CONTENT_TYPE>>> which should
- hold the advertised content type of the file (eg from a webserver or
- a content repository).
-
-
-* {Mime Magic Detction}
-
- By looking for special ("magic") patterns of bytes near the start of
- the file, it is often possible to detect the type of the file. For
- some file types, this is a simple process. For others, typically
- container based formats, the magic detection may not be enough. (More
- detail on detecting container formats below)
-
- Tika is able to make use of a a mime magic info file, in the
- {{{http://www.freedesktop.org/standards/shared-mime-info}Freedesktop MIME-info}}
- format to peform mime magic detection.
-
- This is provided within Tika by
- {{{./api/org/apache/tika/detect/MagicDetector.html}org.apache.tika.detect.MagicDetector}}. It is most commonly access via
- {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}},
- normally sourced from the <<<tika-mimetypes.xml>>> file.
-
-
-* {Resource Name Based Detection}
-
- Where the name of the file is known, it is sometimes possible to guess
- the file type from the name or extension. Within the
- <<<tika-mimetypes.xml>>> file is a list of patterns which are used to
- identify the type from the filename.
-
- However, because files may be renamed, this method of detection is quick
- but not always as accurate.
-
- This is provided within Tika by
- {{{./api/org/apache/tika/detect/NameDetector.html}org.apache.tika.detect.NameDetector}}.
-
-
-* {Known Content Type "Detection}
-
- Sometimes, the mime type for a file is already known, such as when
- downloading from a webserver, or when retrieving from a content store.
- This information can be used by detectors, such as
- {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}},
-
-
-* {The default Mime Types Detector}
-
- By default, the mime type detection in Tika is provided by
- {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}.
- This detector makes use of <<<tika-mimetypes.xml>>> to power
- magic based and filename based detection.
-
- Firstly, magic based detection is used on the start of the file.
- If the file is an XML file, then the start of the XML is processed
- to look for root elements. Next, if available, the filename
- (from <<<Metadata.RESOURCE_NAME_KEY>>>) is
- then used to improve the detail of the detection, such as when magic
- detects a text file, and the filename hints it's really a CSV. Finally,
- if available, the supplied content type (from <<<Metadata.CONTENT_TYPE>>>)
- is used to further refine the type.
-
-
-* {Container Aware Detection}
-
- Several common file formats are actually held within a common container
- format. One example is the PowerPoint .ppt and Word .doc formats, which
- are both held within an OLE2 container. Another is Apple iWork formats,
- which are actually a series of XML files within a Zip file.
-
- Using magic detection, it is easy to spot that a given file is an OLE2
- document, or a Zip file. Using magic detection alone, it is very difficult
- (and often impossible) to tell what kind of file lives inside the container.
-
- For some use cases, speed is important, so having a quick way to know the
- container type is sufficient. For other cases however, you don't mind
- spending a bit of time (and memory!) processing the container to get a
- more accurate answer on its contents. For these cases, a container
- aware detector should be used.
-
- Tika provides a wrapping detector in the parsers bundle, of
- {{{./api/org/apache/tika/detect/ContainerAwareDetector.html}org.apache.tika.detect.ContainerAwareDetector}}.
- This detector will check for certain known containers, and if found,
- will open them and detect the appropriate type based on the contents.
- If the file isn't a known container, it will fall back to another
- detector for the answer (most commonly the default
- <<<MimeTypes>>> detector)
-
- Because this detector needs to read the whole file to process the
- container, it must be used with a
- {{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}.
- If called with a regular <<<InputStream>>>, then all work will be done
- by the fallback detector.
-
- For more information on container formats and Tika, see
- {{{http://wiki.apache.org/tika/MetadataDiscussion}}}
-
-
-* {Language Detection}
-
- Tika is able to help identify the language of a piece of text, which
- is useful when extracting text from document formats which do not include
- language information in their metadata.
-
- The language detection is provided by
- {{{./api/org/apache/tika/language/LanguageIdentifier.html}org.apache.tika.language.LanguageIdentifier}}
+ -----------------
+ Content Detection
+ -----------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Content Detection
+
+ This page gives you information on how content and language detection
+ works with Apache Tika, and how to tune the behaviour of Tika.
+
+%{toc|section=1|fromDepth=1}
+
+* {The Detector Interface}
+
+ The
+ {{{./api/org/apache/tika/detect/Detector.html}org.apache.tika.detect.Detector}}
+ interface is the basis for most of the content type detection in Apache
+ Tika. All the different ways of detecting content all implement the
+ same common method:
+
+---
+MediaType detect(java.io.InputStream input,
+ Metadata metadata) throws java.io.IOException
+---
+
+ The <<<detect>>> method takes the stream to inspect, and a
+ <<<Metadata>>> object that holds any additional information on
+ the content. The detector will return a
+ {{{./api/org/apache/tika/mime/MediaType.html}MediaType}} object describing
+ its best guess as to the type of the file.
+
+ In general, only two keys on the Metadata object are used by Detectors.
+ These are <<<Metadata.RESOURCE_NAME_KEY>>> which should hold the name
+ of the file (where known), and <<<Metadata.CONTENT_TYPE>>> which should
+ hold the advertised content type of the file (eg from a webserver or
+ a content repository).
+
+
+* {Mime Magic Detction}
+
+ By looking for special ("magic") patterns of bytes near the start of
+ the file, it is often possible to detect the type of the file. For
+ some file types, this is a simple process. For others, typically
+ container based formats, the magic detection may not be enough. (More
+ detail on detecting container formats below)
+
+ Tika is able to make use of a a mime magic info file, in the
+ {{{http://www.freedesktop.org/standards/shared-mime-info}Freedesktop MIME-info}}
+ format to peform mime magic detection.
+
+ This is provided within Tika by
+ {{{./api/org/apache/tika/detect/MagicDetector.html}org.apache.tika.detect.MagicDetector}}. It is most commonly access via
+ {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}},
+ normally sourced from the <<<tika-mimetypes.xml>>> file.
+
+
+* {Resource Name Based Detection}
+
+ Where the name of the file is known, it is sometimes possible to guess
+ the file type from the name or extension. Within the
+ <<<tika-mimetypes.xml>>> file is a list of patterns which are used to
+ identify the type from the filename.
+
+ However, because files may be renamed, this method of detection is quick
+ but not always as accurate.
+
+ This is provided within Tika by
+ {{{./api/org/apache/tika/detect/NameDetector.html}org.apache.tika.detect.NameDetector}}.
+
+
+* {Known Content Type "Detection}
+
+ Sometimes, the mime type for a file is already known, such as when
+ downloading from a webserver, or when retrieving from a content store.
+ This information can be used by detectors, such as
+ {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}},
+
+
+* {The default Mime Types Detector}
+
+ By default, the mime type detection in Tika is provided by
+ {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}.
+ This detector makes use of <<<tika-mimetypes.xml>>> to power
+ magic based and filename based detection.
+
+ Firstly, magic based detection is used on the start of the file.
+ If the file is an XML file, then the start of the XML is processed
+ to look for root elements. Next, if available, the filename
+ (from <<<Metadata.RESOURCE_NAME_KEY>>>) is
+ then used to improve the detail of the detection, such as when magic
+ detects a text file, and the filename hints it's really a CSV. Finally,
+ if available, the supplied content type (from <<<Metadata.CONTENT_TYPE>>>)
+ is used to further refine the type.
+
+
+* {Container Aware Detection}
+
+ Several common file formats are actually held within a common container
+ format. One example is the PowerPoint .ppt and Word .doc formats, which
+ are both held within an OLE2 container. Another is Apple iWork formats,
+ which are actually a series of XML files within a Zip file.
+
+ Using magic detection, it is easy to spot that a given file is an OLE2
+ document, or a Zip file. Using magic detection alone, it is very difficult
+ (and often impossible) to tell what kind of file lives inside the container.
+
+ For some use cases, speed is important, so having a quick way to know the
+ container type is sufficient. For other cases however, you don't mind
+ spending a bit of time (and memory!) processing the container to get a
+ more accurate answer on its contents. For these cases, a container
+ aware detector should be used.
+
+ Tika provides a wrapping detector in the parsers bundle, of
+ {{{./api/org/apache/tika/detect/ContainerAwareDetector.html}org.apache.tika.detect.ContainerAwareDetector}}.
+ This detector will check for certain known containers, and if found,
+ will open them and detect the appropriate type based on the contents.
+ If the file isn't a known container, it will fall back to another
+ detector for the answer (most commonly the default
+ <<<MimeTypes>>> detector)
+
+ Because this detector needs to read the whole file to process the
+ container, it must be used with a
+ {{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}.
+ If called with a regular <<<InputStream>>>, then all work will be done
+ by the fallback detector.
+
+ For more information on container formats and Tika, see
+ {{{http://wiki.apache.org/tika/MetadataDiscussion}}}
+
+
+* {Language Detection}
+
+ Tika is able to help identify the language of a piece of text, which
+ is useful when extracting text from document formats which do not include
+ language information in their metadata.
+
+ The language detection is provided by
+ {{{./api/org/apache/tika/language/LanguageIdentifier.html}org.apache.tika.language.LanguageIdentifier}}
Propchange: tika/site/src/site/apt/0.7/detection.apt
------------------------------------------------------------------------------
svn:eol-style = native