You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2008/09/13 13:19:36 UTC

svn commit: r694931 - /incubator/tika/trunk/src/site/apt/documentation.apt

Author: jukka
Date: Sat Sep 13 04:19:35 2008
New Revision: 694931

URL: http://svn.apache.org/viewvc?rev=694931&view=rev
Log:
Documentation, first draft...

Added:
    incubator/tika/trunk/src/site/apt/documentation.apt

Added: incubator/tika/trunk/src/site/apt/documentation.apt
URL: http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/apt/documentation.apt?rev=694931&view=auto
==============================================================================
--- incubator/tika/trunk/src/site/apt/documentation.apt (added)
+++ incubator/tika/trunk/src/site/apt/documentation.apt Sat Sep 13 04:19:35 2008
@@ -0,0 +1,195 @@
+                       -------------------------
+                       Apache Tika Documentation
+                       -------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika Documentation
+
+   This document describes the key abstractions and usage of Apache Tika.
+
+The Parser interface
+
+   The <<<org.apache.tika.parser.Parser>>> interface is the key concept
+   of Apache Tika. It hides the complexity of different file formats and
+   parsing libraries while providing a simple and powerful mechanism for
+   client applications to extract structured text content and metadata from
+   all sorts of documents. All this is achieved with a single method:
+
+---
+void parse(
+    InputStream stream, ContentHandler handler, Metadata metadata)
+    throws IOException, SAXException, TikaException;
+---
+
+   The <<<parse>>> method takes the document to be parsed and related metadata
+   as input and outputs the results as XHTML SAX events and extra metadata.
+   The main criteria that lead to this design were:
+
+   [Streamed parsing] The interface should require neither the client
+     application nor the parser implementation to keep the full document
+     content in memory or spooled to disk. This allows even huge documents
+     to be parsed without excessive resource requirements.
+
+   [Structured content] A parser implementation should be able to
+     include structural information (headings, links, etc.) in the extracted
+     content. A client application can use this information for example to
+     better judge the relevance of different parts of the parsed document.
+
+   [Input metadata] A client application should be able to include metadata
+     like the file name or declared content type with the document to be
+     parsed. The parser implementation can use this information to better
+     guide the parsing process.
+
+   [Output metadata] A parser implementation should be able to return
+     document metadata in addition to document content. Many document
+     formats contain metadata like the name of the author that may be useful
+     to client applications.
+
+   These criteria are reflected in the arguments of the <<<parse>>> method.
+
+Document input stream
+
+   The first argument is an
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}input stream}}
+   for reading the document to be parsed.
+
+   If this document stream can not be read, then parsing stops and the thrown
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
+   is passed up to the client application. If the stream can be read but
+   not parsed (for example if the document is corrupted), then the parser
+   throws a <<<org.apache.tika.exception.TikaException>>>.
+
+   The parser implementation will consume this stream but <will not close it>.
+   Closing the stream is the responsibility of the client application that
+   opened it in the first place. The recommended pattern for using streams
+   with the <<<parse>>> method is:
+
+---
+InputStream stream = ...;      // open the stream
+try {
+    parser.parse(stream, ...); // parse the stream
+} finally {
+    stream.close();            // close the stream
+}
+---
+
+   Some parser libraries (like {{{http://poi.apache.org/}Apache POI}}) require
+   the input document to be a file on the file system. In such cases the
+   content of the input stream is automatically spooled to a temporary file
+   that gets removed once parsed. A future version of Tika may make it possible
+   to avoid this extra file if the input document is already a file in the
+   local file system. See
+   {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
+   of this feature request.
+
+XHTML SAX events
+
+   The parsed content of the document stream is returned to the client
+   application as a sequence of XHTML SAX events. XHTML is used to express
+   structured content of the document and SAX events enable streamed
+   processing. Note that the XHTML format is used here only to convey
+   structural information, not to render the documents for browsing!
+
+   The XHTML SAX events produced by the parser implementation are sent to the
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}content handler}}
+   instance given to the <<<parse>>> method.
+
+   If this the content handler fails to process an event, then parsing stops
+   and the thrown
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
+   is passed up to the client application.
+
+   The overall structure of the generated event stream is (with indenting
+   added for clarity):
+
+---
+<html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+    <title>...</title>
+  </head>
+  <body>
+    ...
+  </body>
+</html>
+---
+
+   Dealing with the raw SAX events can be a bit complex, so Apache Tika (since
+   version 0.2) comes with a number of utility classes that can be used to
+   process and convert the event stream to other representations.
+
+   For example, the <<<org.apache.tika.sax.BodyContentHandler>>> class can be
+   used to extract just the body part of the XHTML output and feed it either
+   as SAX events to another content handler or as characters to an output
+   stream, a writer, or simply a string. The following code snippet parses
+   a document from the standard input stream and outputs the extracted text
+   content to standard output:
+
+---
+ContentHandler handler = new BodyContentHandler(System.out);
+parser.parse(System.in, handler, ...);
+---
+
+   Another useful class is <<<org.apache.tika.parser.ParsingReader>>> that
+   uses a background thread to parse the document and returns the extracted
+   text content as a character stream:
+
+---
+InputStream stream = ...; // the document to be parsed
+Reader reader = new ParsingReader(parser, stream, ...);
+try {
+    ...;                  // read the document text using the reader
+} finally {
+    reader.close();       // the document stream is closed automatically
+}
+---
+
+Document metadata
+
+   The final argument to the <<<parse>>> method is used to pass document
+   metadata both in and out of the parser. Document metadata is expressed
+   as an <<<org.apache.tika.metadata.Metadata>>> object.
+
+   The following are some of the more interesting metadata properties:
+
+   [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
+    the document.
+
+    A client application can set this property to allow the parser to use
+    file name heuristics to determine the format of the document.
+
+    The parser implementation may set this property if the file format
+    contains the canonical name of the file (for example the Gzip format
+    has a slot for the file name).
+
+   [Metadata.CONTENT_TYPE] The declared content type of the document.
+
+    A client application can set this property based on for example a HTTP
+    Content-Type header. The declared content type may help the parser to
+    correctly interpret the document.
+
+    The parser implementation sets this property to the content type according
+    to which the document was parsed.
+
+   [Metadata.TITLE] The title of the document.
+
+    The parser implementation sets this property if the document format
+    contains an explicit title field.
+
+   [Metadata.AUTHOR] The name of the author of the document.
+
+    The parser implementation sets this property if the document format
+    contains an explicit author field.