You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2008/09/27 23:01:07 UTC

svn commit: r699736 - /incubator/tika/trunk/src/site/apt/documentation.apt

Author: jukka
Date: Sat Sep 27 14:01:07 2008
New Revision: 699736

URL: http://svn.apache.org/viewvc?rev=699736&view=rev
Log:
Improved/extended documentation

Modified:
    incubator/tika/trunk/src/site/apt/documentation.apt

Modified: incubator/tika/trunk/src/site/apt/documentation.apt
URL: http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/apt/documentation.apt?rev=699736&r1=699735&r2=699736&view=diff
==============================================================================
--- incubator/tika/trunk/src/site/apt/documentation.apt (original)
+++ incubator/tika/trunk/src/site/apt/documentation.apt Sat Sep 27 14:01:07 2008
@@ -23,15 +23,15 @@
 
 The Parser interface
 
-   The <<<org.apache.tika.parser.Parser>>> interface is the key concept
-   of Apache Tika. It hides the complexity of different file formats and
-   parsing libraries while providing a simple and powerful mechanism for
-   client applications to extract structured text content and metadata from
-   all sorts of documents. All this is achieved with a single method:
+   The {{{apidocs/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}
+   interface is the key concept of Apache Tika. It hides the complexity of
+   different file formats and parsing libraries while providing a simple and
+   powerful mechanism for client applications to extract structured text
+   content and metadata from all sorts of documents. All this is achieved
+   with a single method:
 
 ---
-void parse(
-    InputStream stream, ContentHandler handler, Metadata metadata)
+void parse(InputStream stream, ContentHandler handler, Metadata metadata)
     throws IOException, SAXException, TikaException;
 ---
 
@@ -59,19 +59,21 @@
      formats contain metadata like the name of the author that may be useful
      to client applications.
 
+   []
+
    These criteria are reflected in the arguments of the <<<parse>>> method.
 
 Document input stream
 
    The first argument is an
-   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}input stream}}
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
    for reading the document to be parsed.
 
    If this document stream can not be read, then parsing stops and the thrown
    {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
    is passed up to the client application. If the stream can be read but
    not parsed (for example if the document is corrupted), then the parser
-   throws a <<<org.apache.tika.exception.TikaException>>>.
+   throws a {{{apidocs/org/apache/tika/exception/TikaException.html}TikaException}}.
 
    The parser implementation will consume this stream but <will not close it>.
    Closing the stream is the responsibility of the client application that
@@ -87,8 +89,8 @@
 }
 ---
 
-   Some parser libraries (like {{{http://poi.apache.org/}Apache POI}}) require
-   the input document to be a file on the file system. In such cases the
+   Some document formats like the OLE2 Compound Document Format used by
+   Microsoft Office are best parsed as random access files. In such cases the
    content of the input stream is automatically spooled to a temporary file
    that gets removed once parsed. A future version of Tika may make it possible
    to avoid this extra file if the input document is already a file in the
@@ -104,8 +106,8 @@
    processing. Note that the XHTML format is used here only to convey
    structural information, not to render the documents for browsing!
 
-   The XHTML SAX events produced by the parser implementation are sent to the
-   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}content handler}}
+   The XHTML SAX events produced by the parser implementation are sent to a
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
    instance given to the <<<parse>>> method.
 
    If this the content handler fails to process an event, then parsing stops
@@ -127,23 +129,29 @@
 </html>
 ---
 
+   Parser implementations typically use the
+   {{{apidocs/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
+   utility class to generate the XHTML output.
+
    Dealing with the raw SAX events can be a bit complex, so Apache Tika (since
    version 0.2) comes with a number of utility classes that can be used to
    process and convert the event stream to other representations.
 
-   For example, the <<<org.apache.tika.sax.BodyContentHandler>>> class can be
-   used to extract just the body part of the XHTML output and feed it either
-   as SAX events to another content handler or as characters to an output
-   stream, a writer, or simply a string. The following code snippet parses
-   a document from the standard input stream and outputs the extracted text
-   content to standard output:
+   For example, the
+   {{{apidocs/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+   class can be used to extract just the body part of the XHTML output and
+   feed it either as SAX events to another content handler or as characters
+   to an output stream, a writer, or simply a string. The following code
+   snippet parses a document from the standard input stream and outputs the
+   extracted text content to standard output:
 
 ---
 ContentHandler handler = new BodyContentHandler(System.out);
 parser.parse(System.in, handler, ...);
 ---
 
-   Another useful class is <<<org.apache.tika.parser.ParsingReader>>> that
+   Another useful class is
+   {{{apidocs/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
    uses a background thread to parse the document and returns the extracted
    text content as a character stream:
 
@@ -161,7 +169,7 @@
 
    The final argument to the <<<parse>>> method is used to pass document
    metadata both in and out of the parser. Document metadata is expressed
-   as an <<<org.apache.tika.metadata.Metadata>>> object.
+   as an {{{apidocs/org/apache/tika/metadata/Metadata.html}Metadata}} object.
 
    The following are some of the more interesting metadata properties:
 
@@ -193,3 +201,29 @@
 
     The parser implementation sets this property if the document format
     contains an explicit author field.
+
+   []
+
+   Note that metadata handling is still being discussed by the Tika development
+   team, and it is likely that there will be some (backwards incompatible)
+   changes in metadata handling before Tika 1.0.
+
+Parser implementations
+
+   Apache Tika comes with a number of parser classes for parsing
+   {{{formats.html}various document formats}}. You can also extend Tika
+   with your own parsers, and of course any contributions to Tika are
+   warmly welcome.
+
+   The goal of Tika is to reuse existing parser libraries like
+   {{{http://www.pdfbox.org/}PDFBox}} or
+   {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
+   of the parser classes in Tika are adapters to such external libraries.
+
+   Tika also contains some general purpose parser implementations that are
+   not targeted at any specific document formats. The most notable of these
+   is the {{{apidocs/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
+   class that encapsulates all Tika functionality into a single parser that
+   can handle any types of documents. This parser will automatically determine
+   the type of the incoming document based on various heuristics and will then
+   parse the document accordingly.