You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2008/09/13 13:19:36 UTC
svn commit: r694931 - /incubator/tika/trunk/src/site/apt/documentation.apt
Author: jukka
Date: Sat Sep 13 04:19:35 2008
New Revision: 694931
URL: http://svn.apache.org/viewvc?rev=694931&view=rev
Log:
Documentation, first draft...
Added:
incubator/tika/trunk/src/site/apt/documentation.apt
Added: incubator/tika/trunk/src/site/apt/documentation.apt
URL: http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/apt/documentation.apt?rev=694931&view=auto
==============================================================================
--- incubator/tika/trunk/src/site/apt/documentation.apt (added)
+++ incubator/tika/trunk/src/site/apt/documentation.apt Sat Sep 13 04:19:35 2008
@@ -0,0 +1,195 @@
+ -------------------------
+ Apache Tika Documentation
+ -------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements. See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License. You may obtain a copy of the License at
+~~
+~~ http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika Documentation
+
+ This document describes the key abstractions and usage of Apache Tika.
+
+The Parser interface
+
+ The <<<org.apache.tika.parser.Parser>>> interface is the key concept
+ of Apache Tika. It hides the complexity of different file formats and
+ parsing libraries while providing a simple and powerful mechanism for
+ client applications to extract structured text content and metadata from
+ all sorts of documents. All this is achieved with a single method:
+
+---
+void parse(
+ InputStream stream, ContentHandler handler, Metadata metadata)
+ throws IOException, SAXException, TikaException;
+---
+
+ The <<<parse>>> method takes the document to be parsed and related metadata
+ as input and outputs the results as XHTML SAX events and extra metadata.
+ The main criteria that lead to this design were:
+
+ [Streamed parsing] The interface should require neither the client
+ application nor the parser implementation to keep the full document
+ content in memory or spooled to disk. This allows even huge documents
+ to be parsed without excessive resource requirements.
+
+ [Structured content] A parser implementation should be able to
+ include structural information (headings, links, etc.) in the extracted
+ content. A client application can use this information for example to
+ better judge the relevance of different parts of the parsed document.
+
+ [Input metadata] A client application should be able to include metadata
+ like the file name or declared content type with the document to be
+ parsed. The parser implementation can use this information to better
+ guide the parsing process.
+
+ [Output metadata] A parser implementation should be able to return
+ document metadata in addition to document content. Many document
+ formats contain metadata like the name of the author that may be useful
+ to client applications.
+
+ These criteria are reflected in the arguments of the <<<parse>>> method.
+
+Document input stream
+
+ The first argument is an
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}input stream}}
+ for reading the document to be parsed.
+
+ If this document stream can not be read, then parsing stops and the thrown
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
+ is passed up to the client application. If the stream can be read but
+ not parsed (for example if the document is corrupted), then the parser
+ throws a <<<org.apache.tika.exception.TikaException>>>.
+
+ The parser implementation will consume this stream but <will not close it>.
+ Closing the stream is the responsibility of the client application that
+ opened it in the first place. The recommended pattern for using streams
+ with the <<<parse>>> method is:
+
+---
+InputStream stream = ...; // open the stream
+try {
+ parser.parse(stream, ...); // parse the stream
+} finally {
+ stream.close(); // close the stream
+}
+---
+
+ Some parser libraries (like {{{http://poi.apache.org/}Apache POI}}) require
+ the input document to be a file on the file system. In such cases the
+ content of the input stream is automatically spooled to a temporary file
+ that gets removed once parsed. A future version of Tika may make it possible
+ to avoid this extra file if the input document is already a file in the
+ local file system. See
+ {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
+ of this feature request.
+
+XHTML SAX events
+
+ The parsed content of the document stream is returned to the client
+ application as a sequence of XHTML SAX events. XHTML is used to express
+ structured content of the document and SAX events enable streamed
+ processing. Note that the XHTML format is used here only to convey
+ structural information, not to render the documents for browsing!
+
+ The XHTML SAX events produced by the parser implementation are sent to the
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}content handler}}
+ instance given to the <<<parse>>> method.
+
+ If this the content handler fails to process an event, then parsing stops
+ and the thrown
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
+ is passed up to the client application.
+
+ The overall structure of the generated event stream is (with indenting
+ added for clarity):
+
+---
+<html xmlns="http://www.w3.org/1999/xhtml">
+ <head>
+ <title>...</title>
+ </head>
+ <body>
+ ...
+ </body>
+</html>
+---
+
+ Dealing with the raw SAX events can be a bit complex, so Apache Tika (since
+ version 0.2) comes with a number of utility classes that can be used to
+ process and convert the event stream to other representations.
+
+ For example, the <<<org.apache.tika.sax.BodyContentHandler>>> class can be
+ used to extract just the body part of the XHTML output and feed it either
+ as SAX events to another content handler or as characters to an output
+ stream, a writer, or simply a string. The following code snippet parses
+ a document from the standard input stream and outputs the extracted text
+ content to standard output:
+
+---
+ContentHandler handler = new BodyContentHandler(System.out);
+parser.parse(System.in, handler, ...);
+---
+
+ Another useful class is <<<org.apache.tika.parser.ParsingReader>>> that
+ uses a background thread to parse the document and returns the extracted
+ text content as a character stream:
+
+---
+InputStream stream = ...; // the document to be parsed
+Reader reader = new ParsingReader(parser, stream, ...);
+try {
+ ...; // read the document text using the reader
+} finally {
+ reader.close(); // the document stream is closed automatically
+}
+---
+
+Document metadata
+
+ The final argument to the <<<parse>>> method is used to pass document
+ metadata both in and out of the parser. Document metadata is expressed
+ as an <<<org.apache.tika.metadata.Metadata>>> object.
+
+ The following are some of the more interesting metadata properties:
+
+ [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
+ the document.
+
+ A client application can set this property to allow the parser to use
+ file name heuristics to determine the format of the document.
+
+ The parser implementation may set this property if the file format
+ contains the canonical name of the file (for example the Gzip format
+ has a slot for the file name).
+
+ [Metadata.CONTENT_TYPE] The declared content type of the document.
+
+ A client application can set this property based on for example a HTTP
+ Content-Type header. The declared content type may help the parser to
+ correctly interpret the document.
+
+ The parser implementation sets this property to the content type according
+ to which the document was parsed.
+
+ [Metadata.TITLE] The title of the document.
+
+ The parser implementation sets this property if the document format
+ contains an explicit title field.
+
+ [Metadata.AUTHOR] The name of the author of the document.
+
+ The parser implementation sets this property if the document format
+ contains an explicit author field.