You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2011/10/11 00:12:20 UTC

svn commit: r1181271 [1/3] - in /tika/site/src/site/apt: ./ 0.10/ 0.5/ 0.6/ 0.7/ 0.8/ 0.9/ 1.0/

Author: jukka
Date: Mon Oct 10 22:12:19 2011
New Revision: 1181271

URL: http://svn.apache.org/viewvc?rev=1181271&view=rev
Log:
site: Add svn:eol-style settings

Modified:
    tika/site/src/site/apt/0.10/detection.apt   (props changed)
    tika/site/src/site/apt/0.10/formats.apt   (props changed)
    tika/site/src/site/apt/0.10/gettingstarted.apt   (props changed)
    tika/site/src/site/apt/0.10/index.apt   (props changed)
    tika/site/src/site/apt/0.10/parser.apt   (props changed)
    tika/site/src/site/apt/0.10/parser_guide.apt   (props changed)
    tika/site/src/site/apt/0.5/documentation.apt   (contents, props changed)
    tika/site/src/site/apt/0.5/formats.apt   (contents, props changed)
    tika/site/src/site/apt/0.5/gettingstarted.apt   (contents, props changed)
    tika/site/src/site/apt/0.5/index.apt   (contents, props changed)
    tika/site/src/site/apt/0.6/formats.apt   (contents, props changed)
    tika/site/src/site/apt/0.6/gettingstarted.apt   (contents, props changed)
    tika/site/src/site/apt/0.6/index.apt   (contents, props changed)
    tika/site/src/site/apt/0.6/parser.apt   (contents, props changed)
    tika/site/src/site/apt/0.7/detection.apt   (contents, props changed)
    tika/site/src/site/apt/0.7/formats.apt   (contents, props changed)
    tika/site/src/site/apt/0.7/gettingstarted.apt   (contents, props changed)
    tika/site/src/site/apt/0.7/index.apt   (contents, props changed)
    tika/site/src/site/apt/0.7/parser.apt   (contents, props changed)
    tika/site/src/site/apt/0.7/parser_guide.apt   (contents, props changed)
    tika/site/src/site/apt/0.8/detection.apt   (props changed)
    tika/site/src/site/apt/0.8/formats.apt   (props changed)
    tika/site/src/site/apt/0.8/gettingstarted.apt   (props changed)
    tika/site/src/site/apt/0.8/index.apt   (props changed)
    tika/site/src/site/apt/0.8/parser.apt   (props changed)
    tika/site/src/site/apt/0.8/parser_guide.apt   (props changed)
    tika/site/src/site/apt/0.9/detection.apt   (props changed)
    tika/site/src/site/apt/0.9/formats.apt   (props changed)
    tika/site/src/site/apt/0.9/gettingstarted.apt   (props changed)
    tika/site/src/site/apt/0.9/index.apt   (props changed)
    tika/site/src/site/apt/0.9/parser.apt   (props changed)
    tika/site/src/site/apt/0.9/parser_guide.apt   (props changed)
    tika/site/src/site/apt/1.0/parser_guide.apt   (props changed)
    tika/site/src/site/apt/download.apt   (contents, props changed)

Propchange: tika/site/src/site/apt/0.10/detection.apt
------------------------------------------------------------------------------
    svn:eol-style = native

Propchange: tika/site/src/site/apt/0.10/formats.apt
------------------------------------------------------------------------------
    svn:eol-style = native

Propchange: tika/site/src/site/apt/0.10/gettingstarted.apt
------------------------------------------------------------------------------
    svn:eol-style = native

Propchange: tika/site/src/site/apt/0.10/index.apt
------------------------------------------------------------------------------
    svn:eol-style = native

Propchange: tika/site/src/site/apt/0.10/parser.apt
------------------------------------------------------------------------------
    svn:eol-style = native

Propchange: tika/site/src/site/apt/0.10/parser_guide.apt
------------------------------------------------------------------------------
    svn:eol-style = native

Modified: tika/site/src/site/apt/0.5/documentation.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.5/documentation.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.5/documentation.apt (original)
+++ tika/site/src/site/apt/0.5/documentation.apt Mon Oct 10 22:12:19 2011
@@ -1,227 +1,227 @@
-                       -------------------------
-                       Apache Tika Documentation
-                       -------------------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements.  See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License.  You may obtain a copy of the License at
-~~
-~~     http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Apache Tika Documentation
-
-   This document describes the key abstractions and usage of Apache Tika.
-
-The Parser interface
-
-   The {{{./api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
-   interface is the key concept of Apache Tika. It hides the complexity of
-   different file formats and parsing libraries while providing a simple and
-   powerful mechanism for client applications to extract structured text
-   content and metadata from all sorts of documents. All this is achieved
-   with a single method:
-
----
-void parse(InputStream stream, ContentHandler handler, Metadata metadata)
-    throws IOException, SAXException, TikaException;
----
-
-   The <<<parse>>> method takes the document to be parsed and related metadata
-   as input and outputs the results as XHTML SAX events and extra metadata.
-   The main criteria that lead to this design were:
-
-   [Streamed parsing] The interface should require neither the client
-     application nor the parser implementation to keep the full document
-     content in memory or spooled to disk. This allows even huge documents
-     to be parsed without excessive resource requirements.
-
-   [Structured content] A parser implementation should be able to
-     include structural information (headings, links, etc.) in the extracted
-     content. A client application can use this information for example to
-     better judge the relevance of different parts of the parsed document.
-
-   [Input metadata] A client application should be able to include metadata
-     like the file name or declared content type with the document to be
-     parsed. The parser implementation can use this information to better
-     guide the parsing process.
-
-   [Output metadata] A parser implementation should be able to return
-     document metadata in addition to document content. Many document
-     formats contain metadata like the name of the author that may be useful
-     to client applications.
-
-   []
-
-   These criteria are reflected in the arguments of the <<<parse>>> method.
-
-Document input stream
-
-   The first argument is an
-   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
-   for reading the document to be parsed.
-
-   If this document stream can not be read, then parsing stops and the thrown
-   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
-   is passed up to the client application. If the stream can be read but
-   not parsed (for example if the document is corrupted), then the parser
-   throws a {{{./api/org/apache/tika/exception/TikaException.html}TikaException}}.
-
-   The parser implementation will consume this stream but <will not close it>.
-   Closing the stream is the responsibility of the client application that
-   opened it in the first place. The recommended pattern for using streams
-   with the <<<parse>>> method is:
-
----
-InputStream stream = ...;      // open the stream
-try {
-    parser.parse(stream, ...); // parse the stream
-} finally {
-    stream.close();            // close the stream
-}
----
-
-   Some document formats like the OLE2 Compound Document Format used by
-   Microsoft Office are best parsed as random access files. In such cases the
-   content of the input stream is automatically spooled to a temporary file
-   that gets removed once parsed. A future version of Tika may make it possible
-   to avoid this extra file if the input document is already a file in the
-   local file system. See
-   {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
-   of this feature request.
-
-XHTML SAX events
-
-   The parsed content of the document stream is returned to the client
-   application as a sequence of XHTML SAX events. XHTML is used to express
-   structured content of the document and SAX events enable streamed
-   processing. Note that the XHTML format is used here only to convey
-   structural information, not to render the documents for browsing!
-
-   The XHTML SAX events produced by the parser implementation are sent to a
-   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
-   instance given to the <<<parse>>> method. If this the content handler
-   fails to process an event, then parsing stops and the thrown
-   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
-   is passed up to the client application.
-
-   The overall structure of the generated event stream is (with indenting
-   added for clarity):
-
----
-<html xmlns="http://www.w3.org/1999/xhtml">
-  <head>
-    <title>...</title>
-  </head>
-  <body>
-    ...
-  </body>
-</html>
----
-
-   Parser implementations typically use the
-   {{{./api/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
-   utility class to generate the XHTML output.
-
-   Dealing with the raw SAX events can be a bit complex, so Apache Tika (since
-   version 0.2) comes with a number of utility classes that can be used to
-   process and convert the event stream to other representations.
-
-   For example, the
-   {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
-   class can be used to extract just the body part of the XHTML output and
-   feed it either as SAX events to another content handler or as characters
-   to an output stream, a writer, or simply a string. The following code
-   snippet parses a document from the standard input stream and outputs the
-   extracted text content to standard output:
-
----
-ContentHandler handler = new BodyContentHandler(System.out);
-parser.parse(System.in, handler, ...);
----
-
-   Another useful class is
-   {{{./api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
-   uses a background thread to parse the document and returns the extracted
-   text content as a character stream:
-
----
-InputStream stream = ...; // the document to be parsed
-Reader reader = new ParsingReader(parser, stream, ...);
-try {
-    ...;                  // read the document text using the reader
-} finally {
-    reader.close();       // the document stream is closed automatically
-}
----
-
-Document metadata
-
-   The final argument to the <<<parse>>> method is used to pass document
-   metadata both in and out of the parser. Document metadata is expressed
-   as an {{{./api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
-
-   The following are some of the more interesting metadata properties:
-
-   [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
-    the document.
-
-    A client application can set this property to allow the parser to use
-    file name heuristics to determine the format of the document.
-
-    The parser implementation may set this property if the file format
-    contains the canonical name of the file (for example the Gzip format
-    has a slot for the file name).
-
-   [Metadata.CONTENT_TYPE] The declared content type of the document.
-
-    A client application can set this property based on for example a HTTP
-    Content-Type header. The declared content type may help the parser to
-    correctly interpret the document.
-
-    The parser implementation sets this property to the content type according
-    to which the document was parsed.
-
-   [Metadata.TITLE] The title of the document.
-
-    The parser implementation sets this property if the document format
-    contains an explicit title field.
-
-   [Metadata.AUTHOR] The name of the author of the document.
-
-    The parser implementation sets this property if the document format
-    contains an explicit author field.
-
-   []
-
-   Note that metadata handling is still being discussed by the Tika development
-   team, and it is likely that there will be some (backwards incompatible)
-   changes in metadata handling before Tika 1.0.
-
-Parser implementations
-
-   Apache Tika comes with a number of parser classes for parsing
-   {{{./formats.html}various document formats}}. You can also extend Tika
-   with your own parsers, and of course any contributions to Tika are
-   warmly welcome.
-
-   The goal of Tika is to reuse existing parser libraries like
-   {{{http://www.pdfbox.org/}PDFBox}} or
-   {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
-   of the parser classes in Tika are adapters to such external libraries.
-
-   Tika also contains some general purpose parser implementations that are
-   not targeted at any specific document formats. The most notable of these
-   is the {{{./api/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
-   class that encapsulates all Tika functionality into a single parser that
-   can handle any types of documents. This parser will automatically determine
-   the type of the incoming document based on various heuristics and will then
-   parse the document accordingly.
+                       -------------------------
+                       Apache Tika Documentation
+                       -------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika Documentation
+
+   This document describes the key abstractions and usage of Apache Tika.
+
+The Parser interface
+
+   The {{{./api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
+   interface is the key concept of Apache Tika. It hides the complexity of
+   different file formats and parsing libraries while providing a simple and
+   powerful mechanism for client applications to extract structured text
+   content and metadata from all sorts of documents. All this is achieved
+   with a single method:
+
+---
+void parse(InputStream stream, ContentHandler handler, Metadata metadata)
+    throws IOException, SAXException, TikaException;
+---
+
+   The <<<parse>>> method takes the document to be parsed and related metadata
+   as input and outputs the results as XHTML SAX events and extra metadata.
+   The main criteria that lead to this design were:
+
+   [Streamed parsing] The interface should require neither the client
+     application nor the parser implementation to keep the full document
+     content in memory or spooled to disk. This allows even huge documents
+     to be parsed without excessive resource requirements.
+
+   [Structured content] A parser implementation should be able to
+     include structural information (headings, links, etc.) in the extracted
+     content. A client application can use this information for example to
+     better judge the relevance of different parts of the parsed document.
+
+   [Input metadata] A client application should be able to include metadata
+     like the file name or declared content type with the document to be
+     parsed. The parser implementation can use this information to better
+     guide the parsing process.
+
+   [Output metadata] A parser implementation should be able to return
+     document metadata in addition to document content. Many document
+     formats contain metadata like the name of the author that may be useful
+     to client applications.
+
+   []
+
+   These criteria are reflected in the arguments of the <<<parse>>> method.
+
+Document input stream
+
+   The first argument is an
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
+   for reading the document to be parsed.
+
+   If this document stream can not be read, then parsing stops and the thrown
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
+   is passed up to the client application. If the stream can be read but
+   not parsed (for example if the document is corrupted), then the parser
+   throws a {{{./api/org/apache/tika/exception/TikaException.html}TikaException}}.
+
+   The parser implementation will consume this stream but <will not close it>.
+   Closing the stream is the responsibility of the client application that
+   opened it in the first place. The recommended pattern for using streams
+   with the <<<parse>>> method is:
+
+---
+InputStream stream = ...;      // open the stream
+try {
+    parser.parse(stream, ...); // parse the stream
+} finally {
+    stream.close();            // close the stream
+}
+---
+
+   Some document formats like the OLE2 Compound Document Format used by
+   Microsoft Office are best parsed as random access files. In such cases the
+   content of the input stream is automatically spooled to a temporary file
+   that gets removed once parsed. A future version of Tika may make it possible
+   to avoid this extra file if the input document is already a file in the
+   local file system. See
+   {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
+   of this feature request.
+
+XHTML SAX events
+
+   The parsed content of the document stream is returned to the client
+   application as a sequence of XHTML SAX events. XHTML is used to express
+   structured content of the document and SAX events enable streamed
+   processing. Note that the XHTML format is used here only to convey
+   structural information, not to render the documents for browsing!
+
+   The XHTML SAX events produced by the parser implementation are sent to a
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+   instance given to the <<<parse>>> method. If this the content handler
+   fails to process an event, then parsing stops and the thrown
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
+   is passed up to the client application.
+
+   The overall structure of the generated event stream is (with indenting
+   added for clarity):
+
+---
+<html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+    <title>...</title>
+  </head>
+  <body>
+    ...
+  </body>
+</html>
+---
+
+   Parser implementations typically use the
+   {{{./api/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
+   utility class to generate the XHTML output.
+
+   Dealing with the raw SAX events can be a bit complex, so Apache Tika (since
+   version 0.2) comes with a number of utility classes that can be used to
+   process and convert the event stream to other representations.
+
+   For example, the
+   {{{./api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+   class can be used to extract just the body part of the XHTML output and
+   feed it either as SAX events to another content handler or as characters
+   to an output stream, a writer, or simply a string. The following code
+   snippet parses a document from the standard input stream and outputs the
+   extracted text content to standard output:
+
+---
+ContentHandler handler = new BodyContentHandler(System.out);
+parser.parse(System.in, handler, ...);
+---
+
+   Another useful class is
+   {{{./api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
+   uses a background thread to parse the document and returns the extracted
+   text content as a character stream:
+
+---
+InputStream stream = ...; // the document to be parsed
+Reader reader = new ParsingReader(parser, stream, ...);
+try {
+    ...;                  // read the document text using the reader
+} finally {
+    reader.close();       // the document stream is closed automatically
+}
+---
+
+Document metadata
+
+   The final argument to the <<<parse>>> method is used to pass document
+   metadata both in and out of the parser. Document metadata is expressed
+   as an {{{./api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
+
+   The following are some of the more interesting metadata properties:
+
+   [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
+    the document.
+
+    A client application can set this property to allow the parser to use
+    file name heuristics to determine the format of the document.
+
+    The parser implementation may set this property if the file format
+    contains the canonical name of the file (for example the Gzip format
+    has a slot for the file name).
+
+   [Metadata.CONTENT_TYPE] The declared content type of the document.
+
+    A client application can set this property based on for example a HTTP
+    Content-Type header. The declared content type may help the parser to
+    correctly interpret the document.
+
+    The parser implementation sets this property to the content type according
+    to which the document was parsed.
+
+   [Metadata.TITLE] The title of the document.
+
+    The parser implementation sets this property if the document format
+    contains an explicit title field.
+
+   [Metadata.AUTHOR] The name of the author of the document.
+
+    The parser implementation sets this property if the document format
+    contains an explicit author field.
+
+   []
+
+   Note that metadata handling is still being discussed by the Tika development
+   team, and it is likely that there will be some (backwards incompatible)
+   changes in metadata handling before Tika 1.0.
+
+Parser implementations
+
+   Apache Tika comes with a number of parser classes for parsing
+   {{{./formats.html}various document formats}}. You can also extend Tika
+   with your own parsers, and of course any contributions to Tika are
+   warmly welcome.
+
+   The goal of Tika is to reuse existing parser libraries like
+   {{{http://www.pdfbox.org/}PDFBox}} or
+   {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
+   of the parser classes in Tika are adapters to such external libraries.
+
+   Tika also contains some general purpose parser implementations that are
+   not targeted at any specific document formats. The most notable of these
+   is the {{{./api/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
+   class that encapsulates all Tika functionality into a single parser that
+   can handle any types of documents. This parser will automatically determine
+   the type of the incoming document based on various heuristics and will then
+   parse the document accordingly.

Propchange: tika/site/src/site/apt/0.5/documentation.apt
------------------------------------------------------------------------------
    svn:eol-style = native

Modified: tika/site/src/site/apt/0.5/formats.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.5/formats.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.5/formats.apt (original)
+++ tika/site/src/site/apt/0.5/formats.apt Mon Oct 10 22:12:19 2011
@@ -1,303 +1,303 @@
-                       --------------------------
-                       Supported Document Formats
-                       --------------------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements.  See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License.  You may obtain a copy of the License at
-~~
-~~     http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Supported Document Formats
-
-   This page lists all the document formats supported by Apache Tika.
-
-* Microsoft's OLE 2 Compound Document format
-
-   A number of Microsoft applications, most notably the Microsoft Office
-   suite, use the generic OLE 2 Compound Document format as the basis of
-   their document formats. Tika uses {{{http://poi.apache.org/}Apache POI}}
-   to support a number of these formats.
-
-   The OLE2 Compound Document format is designed for use with random access
-   files, and so the input stream passed to a Tika parser needs to be spooled
-   in memory or in a temporary file depending on the size of the document.
-   See {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for an
-   effort to avoid this extra temporary file if the input document already
-   comes from a file.
-
-   In addition to the shared base format there's also a shared sets of
-   metadata in typical OLE2 documents. Tika uses the
-   {{{http://poi.apache.org/hpsf/}HPSF library}} from POI to parse these
-   property sets and exposes them as the following document metadata:
-
-      * <<<TITLE>>> Title
-
-      * <<<SUBJECT>>> Subject
-
-      * <<<AUTHOR>>> Author
-
-      * <<<KEYWORDS>>> Keywords
-
-      * <<<COMMENTS>>> Comments
-
-      * <<<TEMPLATE>>> Template
-
-      * <<<LAST_SAVED>>> Last Saved By
-
-      * <<<REVISION_NUMBER>>> Revision Number
-
-      * <<<LAST_PRINTED>>> Last Printed
-
-      * <<<LAST_SAVED>>> Last Saved Time/Date
-
-      * <<<LAST_SAVED>>> Last Saved Time/Date
-
-      * <<<PAGE_COUNT>>> Number of Pages
-
-      * <<<WORD_COUNT>>> Number of Words
-
-      * <<<CHARACTER_COUNT>>> Number of Characters
-
-      * <<<APPLICATION_NAME>>> Name of Creating Application
-
-   Note that in practice the metadata in many documents is either missing,
-   incomplete or even incorrect, so a client application should not rely
-   too much on this information.
-
-   Support for the new Office Open XML format used by Microsoft Office
-   version 2007 is pending for a POI upgrade. Current status is recorded in
-   {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}.
-
-   The generic OLE2 Compound Document format is automatically detected using
-   a magic number, and further parsing can automatically determine the more
-   specific document format. Tika also knows a number of common glob patterns
-   like <<<*.doc>>> and <<<*.ppt>>> for these formats.
-
-   The supported OLE 2 Compound Document formats are:
-
-   [Microsoft Excel (application/vnd.ms-excel)]
-    Excel spreadsheet support is available in all versions of Tika and is
-    based on the {{{http://poi.apache.org/hssf/}HSSF library}} from POI.
-
-    The Excel parser in Tika uses the
-    {{{http://poi.apache.org/hssf/how-to.html#event_api}HSSF event API}} and
-    is able to extract much of the document structure, including all
-    (non-empty) worksheets and their table structures. Formula results are
-    extracted as stored in the Excel file, and cell links are exposed as
-    XHTML links. These features were added in Tika version 0.2.
-
-    Cell comments and formatting are currently not supported. See
-    {{{https://issues.apache.org/jira/browse/TIKA-148}TIKA-148}} and
-    {{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}} for the
-    respective issues.
-
-   [Microsoft Word (application/msword)]
-    Word document support is available in all versions of Tika and is based
-    on the {{{http://poi.apache.org/hwpf/}HWPF library}} from POI.
-
-    The Word parser uses the
-    {{{http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html}WordExtractor}}
-    class from HWPF to extract document content as a sequence of paragraphs.
-
-   [Microsoft PowerPoint (application/vnd.ms-powerpoint)]
-    PowerPoint presentation support is available in all versions of Tika and
-    is based on the {{{http://poi.apache.org/hslf/}HSLF library}} from POI.
-
-    The PowerPoint parser uses the
-    {{{http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html}PowerPointExtractor}}
-    class from HSLF to extract spreadsheet content as a single paragraph.
-
-   [Microsoft Visio (application/vnd.visio)]
-    Visio diagram support was added in Tika version 0.2 and is based on the
-    {{{http://poi.apache.org/hdgf/}HDGF library}} from POI.
-
-    The Visio parser uses the
-    {{{http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html}VisioExtractor}}
-    class from HDGF to extract diagram content as a sequence of paragraphs.
-
-   [Microsoft Outlook (application/vnd.ms-outlook)]
-    Outlook message support was added in Tika version 0.2 and is based on the
-    {{{http://poi.apache.org/hsmf/}HSMF library}} from POI.
-
-    The Outlook parser extracts the subject of the message and the From,
-    To, Cc, and Bcc addresses (formatted for display) along with the body
-    text of text/plain messages. The <<<AUTHOR>>>, <<<TITLE>>> and
-    <<<SUBJECT>>> metadata properties are set explicitly, overriding
-    potential generic document metadata retrieved from OLE2 property sets.
-
-* Compression formats
-
-   General purpose compression formats are used to reduce the size of
-   any kinds of documents. Tika uses a parsing pipeline to support general
-   purpose compression: in the first stage the compressed stream decompressed
-   and the resulting decompressed stream is passed on to a second parsing
-   stage where it will be processed as if the document had never been
-   compressed.
-
-   Tika contains magic numbers and glob patterns for auto-detecting all
-   supported compression formats. The glob patterns of compression formats
-   are also used to determine the name of the original uncompressed document.
-   If a client application has supplied a <<<RESOURCE_NAME_KEY>>> metadata
-   property that matches such a glob pattern, then the decompressing first
-   parsing stage will replace the <<<RESOURCE_NAME_KEY>>> metadata property
-   with the deduced original document name before passing control to the
-   second parsing stage.
-
-   Note that apart from the special handling of the <<<RESOURCE_NAME_KEY>>>
-   property, no document metadata is passed to or from the second parsing
-   stage. Only the text content extracted by the second stage parser is
-   returned to the client application.
-
-   The supported compression formats are:
-
-   [gzip compression (application/x-gzip)]
-    {{{http://en.wikipedia.org/wiki/Gzip}Gzip}} support was added in
-    Tika version 0.2 and is based on the
-    {{{http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html}GZIPInputStream}}
-    class in the Java 5 class library.
-
-    The known gzip glob patterns are <<<*.tgz>>>, <<<*.gz>>> and <<<*-gz>>>,
-    and they will respectively be replaced with <<<*.tar>>>, <<<*>>> and
-    <<<*>>> as described above.
-
-   [bzip2 compression (application/x-bzip)]
-    {{{http://en.wikipedia.org/wiki/Bzip2}Bzip2}} support was added in
-    Tika version 0.2 and is based on bzip2 parsing code from
-    {{{http://ant.apache.org/}Apache Ant}}, which in turn was originally
-    based on work by Keiron Liddle from Aftex Software.
-
-    The known bzip2 glob patterns are <<<*.tbz>>>, <<<*.tbz2>>>, <<<*.bz>>>
-    and <<<*.bz2>>>, and they will respectively be replaced with <<<*.tar>>>,
-    <<<*.tar>>>, <<<*>>> and <<<*>>> as described above.
-
-* Audio formats
-
-   Tika can detect several common audio formats and extract metadata
-   from them. Text extraction is supported for some MIDI-based karaoke
-   formats that contain the lyrics of the encoded audio.
-
-   See {{{https://issues.apache.org/jira/browse/TIKA-94}TIKA-94}} for
-   an effort to integrate speech recognition support to Tika.
-
-   [MP3 Audio (audio/mpeg)]
-    The parsing of {{{http://www.id3.org/ID3v1}ID3v1}} tags from MP3 files
-    was added in Tika version 0.2. If found the following metadata is
-    extracted and set:
-
-      * <<<TITLE>>> Title
-
-      * <<<SUBJECT>>> Subject
-
-    The above information, as well as the <<<Album>>>, <<<Track>>>,
-    <<<Year>>>, <<<Genre>>> and additional <<<Comment>>> are extracted
-    when set in the file.
-
-   [MIDI audio (audio/midi)]
-    Tika uses the MIDI support in <<<javax.audio.midi>>> to parse MIDI
-    sequence files. Many karaoke file formats are based on MIDI, and
-    contain lyrics as embedded text tracks that Tika knows how to extract.
-
-    Support for MIDI files was added in Tika 0.3.
-
-   [Wave audio (audio/basic)]
-    Tika supports sampled wave audio (.wav files, etc.) using the
-    <<<javax.audio.sampled>>> package. Only sampling metadata is extracted.
-
-    Support for sampled wave audio was added in Tika 0.3. 
-
-* Other supported formats
-
-   [Extensible Markup Language (application/xml)]
-    Tika uses the <<<javax.xml>>> classes to parse Extensible Markup Language files.
-    Support for Extensible Markup Language files was added in Tika 0.1.
-
-   [HyperText Markup Language (text/html)]
-    Tika uses the {{{http://sourceforge.net/projects/nekohtml}CyberNeko}} library to parse HyperText Markup Language files.
-    Support for HyperText Markup Language files was added in Tika 0.1.
-
-   [Images (image/*)]
-    Tika uses the <<<javax.imageio>>> classes to extract metadata
-    from image files.
-
-    Support for Image files was added in Tika 0.2.
-
-   [Java class files]
-    The parsing of Java Class files is based on the asm library and
-    work by Dave Brosius in JCR-1522.
-
-    Support for Java Class files was added in Tika 0.2.
-
-   [Java jar archives]
-    The parsing of Java JAR archives is performed using a combination of
-    the ZIP and Java class file parsers.
-
-    Support for Java JAR archives was added in Tika 0.2.
-
-   [OpenDocument (application/vnd.oasis.opendocument.*)]
-    Tika uses the built-in ZIP and XML features in Java to parse the
-    {{{http://en.wikipedia.org/wiki/OpenDocument}OpenDocument}} document types
-    used most notably by OpenOffice 2.0 and higher. The older OpenOffice 1.0
-    formats are also supported, though they are currently not auto-detected
-    as well as the newer formats.
-
-    Support for the OpenDocument formats was added in Tika 0.3.
-
-   [Plain text (text/plain)]
-    Tika uses the
-    {{{http://www.icu-project.org/}International Components for Unicode}}
-    Java library (ICU4J) to parse plain text. Support for plain text was added
-    in Tika 0.1.
-
-    Extracting text content from plain text files is actually a relatively
-    complex task due to the fact that the character encoding of the text
-    file is often unknown to the parser.
-
-    The text parser in Tika uses the ICU4J
-    {{{http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html}CharsetDetector}}
-    class to automatically detect the character encoding of any text input.
-    As an added benefit, the ICU4J library is in some cases able to detect
-    also the language in which the text is written.
-
-    The character encoding and language of the plain text document are
-    returned as the <<<Metadata.CONTENT_ENCODING>>> and <<<Metadata.LANGUAGE>>>
-    metadata properties. If the (declared) content encoding of a text document
-    is already known to the client application, then it can be supplied as the
-    <<<Metadata.CONTENT_ENCODING>>> metadata property to the parser to
-    simplify encoding detection.
-
-   [Portable Document Format (application/pdf)]
-    Tika uses the {{{http://www.pdfbox.org}PDFBox}} library to parse
-    Portable Document Format (PDF) documents.
-
-    Support for PDF was added in Tika 0.1.
-
-   [Rich Text Format (application/rtf)]
-    Tika uses Java's built-in Swing library to parse Rich Text Format (RTF)
-    documents. Support for RTF was added in Tika 0.1.
-
-    The RTF parser in Tika uses the Swing
-    {{{http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/text/rtf/RTFEditorKit.html}RTFEditorKit}}
-    class to extract all text from an RTF document as a single paragraph.
-    Document metadata extraction is currently not supported.
-
-   [tar archive (application/x-tar)]
-    Tika uses an adapted version of the tar parsing code from
-    {{{http://ant.apache.org/}Apache Ant}} to parse tar archives.
-    The tar code is originally based on work by Timothy Gerard Endres.
-
-    Support for tar archives was added in Tika 0.2.
-
-   [ZIP archive (application/zip)]
-    Tika uses Java's built-in Zip classes to parse ZIP files.
-
-    Support for ZIP was added in Tika 0.2.
+                       --------------------------
+                       Supported Document Formats
+                       --------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Supported Document Formats
+
+   This page lists all the document formats supported by Apache Tika.
+
+* Microsoft's OLE 2 Compound Document format
+
+   A number of Microsoft applications, most notably the Microsoft Office
+   suite, use the generic OLE 2 Compound Document format as the basis of
+   their document formats. Tika uses {{{http://poi.apache.org/}Apache POI}}
+   to support a number of these formats.
+
+   The OLE2 Compound Document format is designed for use with random access
+   files, and so the input stream passed to a Tika parser needs to be spooled
+   in memory or in a temporary file depending on the size of the document.
+   See {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for an
+   effort to avoid this extra temporary file if the input document already
+   comes from a file.
+
+   In addition to the shared base format there's also a shared sets of
+   metadata in typical OLE2 documents. Tika uses the
+   {{{http://poi.apache.org/hpsf/}HPSF library}} from POI to parse these
+   property sets and exposes them as the following document metadata:
+
+      * <<<TITLE>>> Title
+
+      * <<<SUBJECT>>> Subject
+
+      * <<<AUTHOR>>> Author
+
+      * <<<KEYWORDS>>> Keywords
+
+      * <<<COMMENTS>>> Comments
+
+      * <<<TEMPLATE>>> Template
+
+      * <<<LAST_SAVED>>> Last Saved By
+
+      * <<<REVISION_NUMBER>>> Revision Number
+
+      * <<<LAST_PRINTED>>> Last Printed
+
+      * <<<LAST_SAVED>>> Last Saved Time/Date
+
+      * <<<LAST_SAVED>>> Last Saved Time/Date
+
+      * <<<PAGE_COUNT>>> Number of Pages
+
+      * <<<WORD_COUNT>>> Number of Words
+
+      * <<<CHARACTER_COUNT>>> Number of Characters
+
+      * <<<APPLICATION_NAME>>> Name of Creating Application
+
+   Note that in practice the metadata in many documents is either missing,
+   incomplete or even incorrect, so a client application should not rely
+   too much on this information.
+
+   Support for the new Office Open XML format used by Microsoft Office
+   version 2007 is pending for a POI upgrade. Current status is recorded in
+   {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}.
+
+   The generic OLE2 Compound Document format is automatically detected using
+   a magic number, and further parsing can automatically determine the more
+   specific document format. Tika also knows a number of common glob patterns
+   like <<<*.doc>>> and <<<*.ppt>>> for these formats.
+
+   The supported OLE 2 Compound Document formats are:
+
+   [Microsoft Excel (application/vnd.ms-excel)]
+    Excel spreadsheet support is available in all versions of Tika and is
+    based on the {{{http://poi.apache.org/hssf/}HSSF library}} from POI.
+
+    The Excel parser in Tika uses the
+    {{{http://poi.apache.org/hssf/how-to.html#event_api}HSSF event API}} and
+    is able to extract much of the document structure, including all
+    (non-empty) worksheets and their table structures. Formula results are
+    extracted as stored in the Excel file, and cell links are exposed as
+    XHTML links. These features were added in Tika version 0.2.
+
+    Cell comments and formatting are currently not supported. See
+    {{{https://issues.apache.org/jira/browse/TIKA-148}TIKA-148}} and
+    {{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}} for the
+    respective issues.
+
+   [Microsoft Word (application/msword)]
+    Word document support is available in all versions of Tika and is based
+    on the {{{http://poi.apache.org/hwpf/}HWPF library}} from POI.
+
+    The Word parser uses the
+    {{{http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html}WordExtractor}}
+    class from HWPF to extract document content as a sequence of paragraphs.
+
+   [Microsoft PowerPoint (application/vnd.ms-powerpoint)]
+    PowerPoint presentation support is available in all versions of Tika and
+    is based on the {{{http://poi.apache.org/hslf/}HSLF library}} from POI.
+
+    The PowerPoint parser uses the
+    {{{http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html}PowerPointExtractor}}
+    class from HSLF to extract spreadsheet content as a single paragraph.
+
+   [Microsoft Visio (application/vnd.visio)]
+    Visio diagram support was added in Tika version 0.2 and is based on the
+    {{{http://poi.apache.org/hdgf/}HDGF library}} from POI.
+
+    The Visio parser uses the
+    {{{http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html}VisioExtractor}}
+    class from HDGF to extract diagram content as a sequence of paragraphs.
+
+   [Microsoft Outlook (application/vnd.ms-outlook)]
+    Outlook message support was added in Tika version 0.2 and is based on the
+    {{{http://poi.apache.org/hsmf/}HSMF library}} from POI.
+
+    The Outlook parser extracts the subject of the message and the From,
+    To, Cc, and Bcc addresses (formatted for display) along with the body
+    text of text/plain messages. The <<<AUTHOR>>>, <<<TITLE>>> and
+    <<<SUBJECT>>> metadata properties are set explicitly, overriding
+    potential generic document metadata retrieved from OLE2 property sets.
+
+* Compression formats
+
+   General purpose compression formats are used to reduce the size of
+   any kinds of documents. Tika uses a parsing pipeline to support general
+   purpose compression: in the first stage the compressed stream decompressed
+   and the resulting decompressed stream is passed on to a second parsing
+   stage where it will be processed as if the document had never been
+   compressed.
+
+   Tika contains magic numbers and glob patterns for auto-detecting all
+   supported compression formats. The glob patterns of compression formats
+   are also used to determine the name of the original uncompressed document.
+   If a client application has supplied a <<<RESOURCE_NAME_KEY>>> metadata
+   property that matches such a glob pattern, then the decompressing first
+   parsing stage will replace the <<<RESOURCE_NAME_KEY>>> metadata property
+   with the deduced original document name before passing control to the
+   second parsing stage.
+
+   Note that apart from the special handling of the <<<RESOURCE_NAME_KEY>>>
+   property, no document metadata is passed to or from the second parsing
+   stage. Only the text content extracted by the second stage parser is
+   returned to the client application.
+
+   The supported compression formats are:
+
+   [gzip compression (application/x-gzip)]
+    {{{http://en.wikipedia.org/wiki/Gzip}Gzip}} support was added in
+    Tika version 0.2 and is based on the
+    {{{http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html}GZIPInputStream}}
+    class in the Java 5 class library.
+
+    The known gzip glob patterns are <<<*.tgz>>>, <<<*.gz>>> and <<<*-gz>>>,
+    and they will respectively be replaced with <<<*.tar>>>, <<<*>>> and
+    <<<*>>> as described above.
+
+   [bzip2 compression (application/x-bzip)]
+    {{{http://en.wikipedia.org/wiki/Bzip2}Bzip2}} support was added in
+    Tika version 0.2 and is based on bzip2 parsing code from
+    {{{http://ant.apache.org/}Apache Ant}}, which in turn was originally
+    based on work by Keiron Liddle from Aftex Software.
+
+    The known bzip2 glob patterns are <<<*.tbz>>>, <<<*.tbz2>>>, <<<*.bz>>>
+    and <<<*.bz2>>>, and they will respectively be replaced with <<<*.tar>>>,
+    <<<*.tar>>>, <<<*>>> and <<<*>>> as described above.
+
+* Audio formats
+
+   Tika can detect several common audio formats and extract metadata
+   from them. Text extraction is supported for some MIDI-based karaoke
+   formats that contain the lyrics of the encoded audio.
+
+   See {{{https://issues.apache.org/jira/browse/TIKA-94}TIKA-94}} for
+   an effort to integrate speech recognition support to Tika.
+
+   [MP3 Audio (audio/mpeg)]
+    The parsing of {{{http://www.id3.org/ID3v1}ID3v1}} tags from MP3 files
+    was added in Tika version 0.2. If found the following metadata is
+    extracted and set:
+
+      * <<<TITLE>>> Title
+
+      * <<<SUBJECT>>> Subject
+
+    The above information, as well as the <<<Album>>>, <<<Track>>>,
+    <<<Year>>>, <<<Genre>>> and additional <<<Comment>>> are extracted
+    when set in the file.
+
+   [MIDI audio (audio/midi)]
+    Tika uses the MIDI support in <<<javax.audio.midi>>> to parse MIDI
+    sequence files. Many karaoke file formats are based on MIDI, and
+    contain lyrics as embedded text tracks that Tika knows how to extract.
+
+    Support for MIDI files was added in Tika 0.3.
+
+   [Wave audio (audio/basic)]
+    Tika supports sampled wave audio (.wav files, etc.) using the
+    <<<javax.audio.sampled>>> package. Only sampling metadata is extracted.
+
+    Support for sampled wave audio was added in Tika 0.3. 
+
+* Other supported formats
+
+   [Extensible Markup Language (application/xml)]
+    Tika uses the <<<javax.xml>>> classes to parse Extensible Markup Language files.
+    Support for Extensible Markup Language files was added in Tika 0.1.
+
+   [HyperText Markup Language (text/html)]
+    Tika uses the {{{http://sourceforge.net/projects/nekohtml}CyberNeko}} library to parse HyperText Markup Language files.
+    Support for HyperText Markup Language files was added in Tika 0.1.
+
+   [Images (image/*)]
+    Tika uses the <<<javax.imageio>>> classes to extract metadata
+    from image files.
+
+    Support for Image files was added in Tika 0.2.
+
+   [Java class files]
+    The parsing of Java Class files is based on the asm library and
+    work by Dave Brosius in JCR-1522.
+
+    Support for Java Class files was added in Tika 0.2.
+
+   [Java jar archives]
+    The parsing of Java JAR archives is performed using a combination of
+    the ZIP and Java class file parsers.
+
+    Support for Java JAR archives was added in Tika 0.2.
+
+   [OpenDocument (application/vnd.oasis.opendocument.*)]
+    Tika uses the built-in ZIP and XML features in Java to parse the
+    {{{http://en.wikipedia.org/wiki/OpenDocument}OpenDocument}} document types
+    used most notably by OpenOffice 2.0 and higher. The older OpenOffice 1.0
+    formats are also supported, though they are currently not auto-detected
+    as well as the newer formats.
+
+    Support for the OpenDocument formats was added in Tika 0.3.
+
+   [Plain text (text/plain)]
+    Tika uses the
+    {{{http://www.icu-project.org/}International Components for Unicode}}
+    Java library (ICU4J) to parse plain text. Support for plain text was added
+    in Tika 0.1.
+
+    Extracting text content from plain text files is actually a relatively
+    complex task due to the fact that the character encoding of the text
+    file is often unknown to the parser.
+
+    The text parser in Tika uses the ICU4J
+    {{{http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html}CharsetDetector}}
+    class to automatically detect the character encoding of any text input.
+    As an added benefit, the ICU4J library is in some cases able to detect
+    also the language in which the text is written.
+
+    The character encoding and language of the plain text document are
+    returned as the <<<Metadata.CONTENT_ENCODING>>> and <<<Metadata.LANGUAGE>>>
+    metadata properties. If the (declared) content encoding of a text document
+    is already known to the client application, then it can be supplied as the
+    <<<Metadata.CONTENT_ENCODING>>> metadata property to the parser to
+    simplify encoding detection.
+
+   [Portable Document Format (application/pdf)]
+    Tika uses the {{{http://www.pdfbox.org}PDFBox}} library to parse
+    Portable Document Format (PDF) documents.
+
+    Support for PDF was added in Tika 0.1.
+
+   [Rich Text Format (application/rtf)]
+    Tika uses Java's built-in Swing library to parse Rich Text Format (RTF)
+    documents. Support for RTF was added in Tika 0.1.
+
+    The RTF parser in Tika uses the Swing
+    {{{http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/text/rtf/RTFEditorKit.html}RTFEditorKit}}
+    class to extract all text from an RTF document as a single paragraph.
+    Document metadata extraction is currently not supported.
+
+   [tar archive (application/x-tar)]
+    Tika uses an adapted version of the tar parsing code from
+    {{{http://ant.apache.org/}Apache Ant}} to parse tar archives.
+    The tar code is originally based on work by Timothy Gerard Endres.
+
+    Support for tar archives was added in Tika 0.2.
+
+   [ZIP archive (application/zip)]
+    Tika uses Java's built-in Zip classes to parse ZIP files.
+
+    Support for ZIP was added in Tika 0.2.

Propchange: tika/site/src/site/apt/0.5/formats.apt
------------------------------------------------------------------------------
    svn:eol-style = native

Modified: tika/site/src/site/apt/0.5/gettingstarted.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.5/gettingstarted.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.5/gettingstarted.apt (original)
+++ tika/site/src/site/apt/0.5/gettingstarted.apt Mon Oct 10 22:12:19 2011
@@ -1,241 +1,241 @@
-                     --------------------------------
-                     Getting Started with Apache Tika
-                     --------------------------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements.  See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License.  You may obtain a copy of the License at
-~~
-~~     http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Getting Started with Apache Tika
-
- This document describes how to build Apache Tika from sources and
- how to start using Tika in an application.
-
-Getting and building the sources
-
- To build Tika from sources you first need to either
- {{{../download.html}download}} a source release or
- {{{../source-repository.html}checkout}} the latest sources from
- version control.
-
- Once you have the sources, you can build them using the
- {{{http://maven.apache.org/}Maven 2}} build system. Executing the
- following command in the base directory will build the sources
- and install the resulting artifacts in your local Maven repository.
-
----
-mvn install
----
-
- See the Maven documentation for more information about the available
- build options.
-
- Note that you need Java 5 or higher to build Tika.
-
-Build artifacts
-
- Starting with Tika 0.5, the build consists of a number of components
- and produces the following main binaries (x.y stands for the current
- Tika version number):
-
- [tika-core/target/tika-core-x.y.jar]
-  Tika core library. Contains the core interfaces and classes of Tika,
-  but none of the parser implementations. Depends only on Java 5.
-
- [tika-core/target/tika-core-x.y-jdk14.jar]
-  Java 1.4 version of the Tika core library.
-
- [tika-parsers/target/tika-parsers-x.y.jar]
-  Tika parsers. Collection of classes that implement the Tika Parser
-  interface based on various external parser libraries.
-
- [tika-app/target/tika-app-x.y.jar]
-  Tika application. Combines the above libraries and all the external
-  parser libraries into a single runnable jar with a GUI and a command
-  line interface.
-
-Using Tika as a Maven dependency
-
- Since the 0.5 release Tika has been split to components to give you
- more control over which parts of Tika you want to use in your application.
- The core library, tika-core, contains the key interfaces and classes, so
- you'll always want to include a dependency to it:
-
----
-  <dependency>
-    <groupId>org.apache.tika</groupId>
-    <artifactId>tika-core</artifactId>
-    <version>x.y</version>  <!-- 0.5 or higher -->
-  </dependency>
----
-
- This dependency only gives you basic Tika functionality without any of
- the parser libraries. If you want to use Tika to parse documents (instead
- of simply detecting document types, etc.), you also need the tika-parsers
- dependency: 
-
----
-  <dependency>
-    <groupId>org.apache.tika</groupId>
-    <artifactId>tika-parsers</artifactId>
-    <version>x.y</version>  <!-- same version as in tika-core -->
-  </dependency>
----
-
- Note that adding this dependency will introduce a number of
- transitive dependencies to your project. You need to make sure that
- these dependencies won't conflict with your existing project dependencies.
- The listing below shows all the compile-scope dependencies of the
- current Tika parsers release (0.5, November 2009). You can use the
- command "mvn dependency:tree" to check the latest tree of dependencies on any
- one of Tika's core, parsers and app projects.
-
----
-org.apache.tika:tika-parent:pom:0.5
-org.apache.tika:tika-core:bundle:0.5
-\- junit:junit:jar:3.8.1:test
-org.apache.tika:tika-parsers:bundle:0.5
-+- org.apache.tika:tika-core:jar:0.5:compile
-+- org.apache.commons:commons-compress:jar:1.0:compile
-+- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:compile
-|  +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:compile
-|  \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:compile
-+- org.apache.poi:poi:jar:3.5-FINAL:compile
-+- org.apache.poi:poi-scratchpad:jar:3.5-FINAL:compile
-+- org.apache.poi:poi-ooxml:jar:3.5-FINAL:compile
-|  +- org.apache.poi:ooxml-schemas:jar:1.0:compile
-|  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
-|  \- dom4j:dom4j:jar:1.6.1:compile
-|     \- xml-apis:xml-apis:jar:1.0.b2:compile
-+- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
-+- commons-logging:commons-logging:jar:1.1.1:compile
-+- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile
-+- asm:asm:jar:3.1:compile
-+- log4j:log4j:jar:1.2.14:compile
-+- junit:junit:jar:3.8.1:test
-+- org.mockito:mockito-core:jar:1.7:test
-|  +- org.hamcrest:hamcrest-core:jar:1.1:test
-|  \- org.objenesis:objenesis:jar:1.0:test
-\- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
-org.apache.tika:tika-app:bundle:0.5
-\- org.apache.tika:tika-parsers:jar:0.5:provided
-   +- org.apache.tika:tika-core:jar:0.5:provided
-   +- org.apache.commons:commons-compress:jar:1.0:provided
-   +- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:provided
-   |  +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:provided
-   |  \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:provided
-   +- org.apache.poi:poi:jar:3.5-FINAL:provided
-   +- org.apache.poi:poi-scratchpad:jar:3.5-FINAL:provided
-   +- org.apache.poi:poi-ooxml:jar:3.5-FINAL:provided
-   |  +- org.apache.poi:ooxml-schemas:jar:1.0:provided
-   |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:provided
-   |  \- dom4j:dom4j:jar:1.6.1:provided
-   |     \- xml-apis:xml-apis:jar:1.0.b2:provided
-   +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:provided
-   +- commons-logging:commons-logging:jar:1.1.1:provided
-   +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:provided
-   +- asm:asm:jar:3.1:provided
-   +- log4j:log4j:jar:1.2.14:provided
-   \- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:provided
----
-
-Using Tika in an Ant project
-
- Unless you use a dependency manager tool like
- {{{http://ant.apache.org/ivy/}Apache Ivy}}, to use Tika in you application
- you can include the Tika jar files and the dependencies individually.
-
----
-<classpath>
-  ... <!-- your other classpath entries -->
-  <pathelement location="path/to/tika-core-0.5.jar"/>
-  <pathelement location="path/to/tika-parsers-0.5.jar"/>
-  <pathelement location="path/to/commons-logging-1.1.1.jar"/>
-  <pathelement location="path/to/commons-compress-1.0.jar"/>
-  <pathelement location="path/to/pdfbox-0.7.3.jar"/>
-  <pathelement location="path/to/fontbox-0.1.0.jar"/>
-  <pathelement location="path/to/jempbox-0.2.0.jar"/>
-  <pathelement location="path/to/bcmail-jdk14-136.jar"/>
-  <pathelement location="path/to/bcprov-jdk14-136.jar"/>
-  <pathelement location="path/to/poi-3.5-beta6.jar"/>
-  <pathelement location="path/to/poi-scratchpad-3.5-beta6.jar"/>
-  <pathelement location="path/to/poi-ooxml-3.5-beta6.jar"/>
-  <pathelement location="path/to/ooxml-schemas-1.0.jar"/>
-  <pathelement location="path/to/xmlbeans-2.3.0.jar"/>
-  <pathelement location="path/to/dom4j-1.6.1.jar"/>
-  <pathelement location="path/to/nekohtml-1.9.9.jar"/>
-  <pathelement location="path/to/xercesImpl-2.8.1.jar"/>
-  <pathelement location="path/to/xml-apis-1.0.b2.jar"/>
-  <pathelement location="path/to/geronimo-stax-api_1.0_spec-1.0.jar"/>
-  <pathelement location="path/to/asm-3.1.jar"/>
-  <pathelement location="path/to/log4j-1.2.14.jar"/>
-</classpath>
----
-
- An easy way to gather all these libraries is to run
- "mvn dependency:copy-dependencies" in the Tika source directory.
- This will copy all Tika dependencies to the <<<target/dependencies>>>
- directory.
-
- Alternatively you can simply drop the entire tika-app jar to your
- classpath to get all of the above dependencies in a single archive.
-
-Using Tika as a command line utility
-
- The Tika application jar (tika-app-x.y.jar) can be used as a command
- line utility for extracting text content and metadata from all sorts of
- files. This runnable jar contains all the dependencies it needs, so
- you don't need to worry about classpath settings to run it.
-
- The usage instructions are shown below.
-
----
-usage: java -jar tika-app-x.y.jar [option] [file]
-
-Options:
-    -? or --help       Print this usage message
-    -v or --verbose    Print debug level messages
-    -g or --gui        Start the Apache Tika GUI
-    -x or --xml        Output XHTML content (default)
-    -h or --html       Output HTML content
-    -t or --text       Output plain text content
-    -m or --metadata   Output only metadata
-
-Description:
-    Apache Tika will parse the file(s) specified on the
-    command line and output the extracted text content
-    or metadata to standard output.
-
-    Instead of a file name you can also specify the URL
-    of a document to be parsed.
-
-    If no file name or URL is specified (or the special
-    name "-" is used), then the standard input stream
-    is parsed.
-
-    Use the "--gui" (or "-g") option to start
-    the Apache Tika GUI. You can drag and drop files
-    from a normal file explorer to the GUI window to
-    extract text content and metadata from the files.
----
-
- You can also use the jar as a component in a Unix pipeline or
- as an external tool in many scripting languages.
-
----
-# Check if an Internet resource contains a specific keyword
-curl http://.../document.doc \
-  | java -jar tika-app-x.y.jar --text \
-  | grep -q keyword
----
+                     --------------------------------
+                     Getting Started with Apache Tika
+                     --------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Getting Started with Apache Tika
+
+ This document describes how to build Apache Tika from sources and
+ how to start using Tika in an application.
+
+Getting and building the sources
+
+ To build Tika from sources you first need to either
+ {{{../download.html}download}} a source release or
+ {{{../source-repository.html}checkout}} the latest sources from
+ version control.
+
+ Once you have the sources, you can build them using the
+ {{{http://maven.apache.org/}Maven 2}} build system. Executing the
+ following command in the base directory will build the sources
+ and install the resulting artifacts in your local Maven repository.
+
+---
+mvn install
+---
+
+ See the Maven documentation for more information about the available
+ build options.
+
+ Note that you need Java 5 or higher to build Tika.
+
+Build artifacts
+
+ Starting with Tika 0.5, the build consists of a number of components
+ and produces the following main binaries (x.y stands for the current
+ Tika version number):
+
+ [tika-core/target/tika-core-x.y.jar]
+  Tika core library. Contains the core interfaces and classes of Tika,
+  but none of the parser implementations. Depends only on Java 5.
+
+ [tika-core/target/tika-core-x.y-jdk14.jar]
+  Java 1.4 version of the Tika core library.
+
+ [tika-parsers/target/tika-parsers-x.y.jar]
+  Tika parsers. Collection of classes that implement the Tika Parser
+  interface based on various external parser libraries.
+
+ [tika-app/target/tika-app-x.y.jar]
+  Tika application. Combines the above libraries and all the external
+  parser libraries into a single runnable jar with a GUI and a command
+  line interface.
+
+Using Tika as a Maven dependency
+
+ Since the 0.5 release Tika has been split to components to give you
+ more control over which parts of Tika you want to use in your application.
+ The core library, tika-core, contains the key interfaces and classes, so
+ you'll always want to include a dependency to it:
+
+---
+  <dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-core</artifactId>
+    <version>x.y</version>  <!-- 0.5 or higher -->
+  </dependency>
+---
+
+ This dependency only gives you basic Tika functionality without any of
+ the parser libraries. If you want to use Tika to parse documents (instead
+ of simply detecting document types, etc.), you also need the tika-parsers
+ dependency: 
+
+---
+  <dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-parsers</artifactId>
+    <version>x.y</version>  <!-- same version as in tika-core -->
+  </dependency>
+---
+
+ Note that adding this dependency will introduce a number of
+ transitive dependencies to your project. You need to make sure that
+ these dependencies won't conflict with your existing project dependencies.
+ The listing below shows all the compile-scope dependencies of the
+ current Tika parsers release (0.5, November 2009). You can use the
+ command "mvn dependency:tree" to check the latest tree of dependencies on any
+ one of Tika's core, parsers and app projects.
+
+---
+org.apache.tika:tika-parent:pom:0.5
+org.apache.tika:tika-core:bundle:0.5
+\- junit:junit:jar:3.8.1:test
+org.apache.tika:tika-parsers:bundle:0.5
++- org.apache.tika:tika-core:jar:0.5:compile
++- org.apache.commons:commons-compress:jar:1.0:compile
++- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:compile
+|  +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:compile
+|  \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:compile
++- org.apache.poi:poi:jar:3.5-FINAL:compile
++- org.apache.poi:poi-scratchpad:jar:3.5-FINAL:compile
++- org.apache.poi:poi-ooxml:jar:3.5-FINAL:compile
+|  +- org.apache.poi:ooxml-schemas:jar:1.0:compile
+|  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
+|  \- dom4j:dom4j:jar:1.6.1:compile
+|     \- xml-apis:xml-apis:jar:1.0.b2:compile
++- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
++- commons-logging:commons-logging:jar:1.1.1:compile
++- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:compile
++- asm:asm:jar:3.1:compile
++- log4j:log4j:jar:1.2.14:compile
++- junit:junit:jar:3.8.1:test
++- org.mockito:mockito-core:jar:1.7:test
+|  +- org.hamcrest:hamcrest-core:jar:1.1:test
+|  \- org.objenesis:objenesis:jar:1.0:test
+\- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
+org.apache.tika:tika-app:bundle:0.5
+\- org.apache.tika:tika-parsers:jar:0.5:provided
+   +- org.apache.tika:tika-core:jar:0.5:provided
+   +- org.apache.commons:commons-compress:jar:1.0:provided
+   +- org.apache.pdfbox:pdfbox:jar:0.8.0-incubating:provided
+   |  +- org.apache.pdfbox:fontbox:jar:0.8.0-incubator:provided
+   |  \- org.apache.pdfbox:jempbox:jar:0.8.0-incubator:provided
+   +- org.apache.poi:poi:jar:3.5-FINAL:provided
+   +- org.apache.poi:poi-scratchpad:jar:3.5-FINAL:provided
+   +- org.apache.poi:poi-ooxml:jar:3.5-FINAL:provided
+   |  +- org.apache.poi:ooxml-schemas:jar:1.0:provided
+   |  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:provided
+   |  \- dom4j:dom4j:jar:1.6.1:provided
+   |     \- xml-apis:xml-apis:jar:1.0.b2:provided
+   +- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:provided
+   +- commons-logging:commons-logging:jar:1.1.1:provided
+   +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:provided
+   +- asm:asm:jar:3.1:provided
+   +- log4j:log4j:jar:1.2.14:provided
+   \- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:provided
+---
+
+Using Tika in an Ant project
+
+ Unless you use a dependency manager tool like
+ {{{http://ant.apache.org/ivy/}Apache Ivy}}, to use Tika in you application
+ you can include the Tika jar files and the dependencies individually.
+
+---
+<classpath>
+  ... <!-- your other classpath entries -->
+  <pathelement location="path/to/tika-core-0.5.jar"/>
+  <pathelement location="path/to/tika-parsers-0.5.jar"/>
+  <pathelement location="path/to/commons-logging-1.1.1.jar"/>
+  <pathelement location="path/to/commons-compress-1.0.jar"/>
+  <pathelement location="path/to/pdfbox-0.7.3.jar"/>
+  <pathelement location="path/to/fontbox-0.1.0.jar"/>
+  <pathelement location="path/to/jempbox-0.2.0.jar"/>
+  <pathelement location="path/to/bcmail-jdk14-136.jar"/>
+  <pathelement location="path/to/bcprov-jdk14-136.jar"/>
+  <pathelement location="path/to/poi-3.5-beta6.jar"/>
+  <pathelement location="path/to/poi-scratchpad-3.5-beta6.jar"/>
+  <pathelement location="path/to/poi-ooxml-3.5-beta6.jar"/>
+  <pathelement location="path/to/ooxml-schemas-1.0.jar"/>
+  <pathelement location="path/to/xmlbeans-2.3.0.jar"/>
+  <pathelement location="path/to/dom4j-1.6.1.jar"/>
+  <pathelement location="path/to/nekohtml-1.9.9.jar"/>
+  <pathelement location="path/to/xercesImpl-2.8.1.jar"/>
+  <pathelement location="path/to/xml-apis-1.0.b2.jar"/>
+  <pathelement location="path/to/geronimo-stax-api_1.0_spec-1.0.jar"/>
+  <pathelement location="path/to/asm-3.1.jar"/>
+  <pathelement location="path/to/log4j-1.2.14.jar"/>
+</classpath>
+---
+
+ An easy way to gather all these libraries is to run
+ "mvn dependency:copy-dependencies" in the Tika source directory.
+ This will copy all Tika dependencies to the <<<target/dependencies>>>
+ directory.
+
+ Alternatively you can simply drop the entire tika-app jar to your
+ classpath to get all of the above dependencies in a single archive.
+
+Using Tika as a command line utility
+
+ The Tika application jar (tika-app-x.y.jar) can be used as a command
+ line utility for extracting text content and metadata from all sorts of
+ files. This runnable jar contains all the dependencies it needs, so
+ you don't need to worry about classpath settings to run it.
+
+ The usage instructions are shown below.
+
+---
+usage: java -jar tika-app-x.y.jar [option] [file]
+
+Options:
+    -? or --help       Print this usage message
+    -v or --verbose    Print debug level messages
+    -g or --gui        Start the Apache Tika GUI
+    -x or --xml        Output XHTML content (default)
+    -h or --html       Output HTML content
+    -t or --text       Output plain text content
+    -m or --metadata   Output only metadata
+
+Description:
+    Apache Tika will parse the file(s) specified on the
+    command line and output the extracted text content
+    or metadata to standard output.
+
+    Instead of a file name you can also specify the URL
+    of a document to be parsed.
+
+    If no file name or URL is specified (or the special
+    name "-" is used), then the standard input stream
+    is parsed.
+
+    Use the "--gui" (or "-g") option to start
+    the Apache Tika GUI. You can drag and drop files
+    from a normal file explorer to the GUI window to
+    extract text content and metadata from the files.
+---
+
+ You can also use the jar as a component in a Unix pipeline or
+ as an external tool in many scripting languages.
+
+---
+# Check if an Internet resource contains a specific keyword
+curl http://.../document.doc \
+  | java -jar tika-app-x.y.jar --text \
+  | grep -q keyword
+---

Propchange: tika/site/src/site/apt/0.5/gettingstarted.apt
------------------------------------------------------------------------------
    svn:eol-style = native

Modified: tika/site/src/site/apt/0.5/index.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/0.5/index.apt?rev=1181271&r1=1181270&r2=1181271&view=diff
==============================================================================
--- tika/site/src/site/apt/0.5/index.apt (original)
+++ tika/site/src/site/apt/0.5/index.apt Mon Oct 10 22:12:19 2011
@@ -1,100 +1,100 @@
-                       ---------------
-                       Apache Tika 0.5
-                       ---------------
-
-~~ Licensed to the Apache Software Foundation (ASF) under one or more
-~~ contributor license agreements.  See the NOTICE file distributed with
-~~ this work for additional information regarding copyright ownership.
-~~ The ASF licenses this file to You under the Apache License, Version 2.0
-~~ (the "License"); you may not use this file except in compliance with
-~~ the License.  You may obtain a copy of the License at
-~~
-~~     http://www.apache.org/licenses/LICENSE-2.0
-~~
-~~ Unless required by applicable law or agreed to in writing, software
-~~ distributed under the License is distributed on an "AS IS" BASIS,
-~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-~~ See the License for the specific language governing permissions and
-~~ limitations under the License.
-
-Apache Tika 0.5
-
-   The most notable changes in Tika 0.5 over the previous release are:
-
-      * Improved RDF/OWL mime detection using both MIME magic as well as
-        pattern matching.
-        ({{{https://issues.apache.org/jira/browse/TIKA-309}TIKA-309}})
-
-      * An org.apache.tika.Tika facade class has been added to simplify
-        common text extraction and type detection use cases.
-        ({{{https://issues.apache.org/jira/browse/TIKA-269}TIKA-269}})
-
-      * A new parse context argument was added to the Parser.parse() method.
-        This context map can be used to pass things like a delegate parser
-        or other settings to the parsing process. The previous parse() method
-        signature has been deprecated and will be removed in Tika 1.0.
-        ({{{https://issues.apache.org/jira/browse/TIKA-275}TIKA-275}})
-
-      * A simple ngram-based language detection mechanism has been added
-        along with predefined language profiles for 18 languages.
-        ({{{https://issues.apache.org/jira/browse/TIKA-209}TIKA-209}})
-
-      * The media type registry in Tika was synchronized with the MIME type
-        configuration in the Apache HTTP Server. Tika now knows about 1274
-        different media types and can detect 672 of those using 927 file
-        extension and 280 magic byte patterns.
-        ({{{https://issues.apache.org/jira/browse/TIKA-285}TIKA-285}})
-
-      * Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing
-        PDF documents. This version is notably better than the 0.7.3 release
-        used earlier.
-        ({{{https://issues.apache.org/jira/browse/TIKA-158}TIKA-158}})
-
-   The following people have contributed to Tika 0.5 by submitting or
-   commenting on the issues resolved in this release:
-
-      * Alex Baranov
-
-      * Bart Hanssens
-
-      * Benson Margulies
-
-      * Chris A. Mattmann
-
-      * Daan de Wit
-
-      * Erik Hetzner
-
-      * Frank Hellwig
-
-      * Jeff Cadow
-
-      * Joachim Zittmayr
-
-      * Jukka Zitting 
-
-      * Julien Nioche
-
-      * Ken Krugler
-
-      * Maxim Valyanskiy
-
-      * MRIT64
-
-      * Paul Borgermans
-
-      * Piotr B.
-
-      * Robert Newson
-
-      * Sascha Szott
-
-      * Ted Dunning
-
-      * Thilo Goetz
-
-      * Uwe Schindler
-
-      * Yuan-Fang Li
-
-   See {{http://tinyurl.com/yl9prwp}} for more details on these contributions.
+                       ---------------
+                       Apache Tika 0.5
+                       ---------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika 0.5
+
+   The most notable changes in Tika 0.5 over the previous release are:
+
+      * Improved RDF/OWL mime detection using both MIME magic as well as
+        pattern matching.
+        ({{{https://issues.apache.org/jira/browse/TIKA-309}TIKA-309}})
+
+      * An org.apache.tika.Tika facade class has been added to simplify
+        common text extraction and type detection use cases.
+        ({{{https://issues.apache.org/jira/browse/TIKA-269}TIKA-269}})
+
+      * A new parse context argument was added to the Parser.parse() method.
+        This context map can be used to pass things like a delegate parser
+        or other settings to the parsing process. The previous parse() method
+        signature has been deprecated and will be removed in Tika 1.0.
+        ({{{https://issues.apache.org/jira/browse/TIKA-275}TIKA-275}})
+
+      * A simple ngram-based language detection mechanism has been added
+        along with predefined language profiles for 18 languages.
+        ({{{https://issues.apache.org/jira/browse/TIKA-209}TIKA-209}})
+
+      * The media type registry in Tika was synchronized with the MIME type
+        configuration in the Apache HTTP Server. Tika now knows about 1274
+        different media types and can detect 672 of those using 927 file
+        extension and 280 magic byte patterns.
+        ({{{https://issues.apache.org/jira/browse/TIKA-285}TIKA-285}})
+
+      * Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing
+        PDF documents. This version is notably better than the 0.7.3 release
+        used earlier.
+        ({{{https://issues.apache.org/jira/browse/TIKA-158}TIKA-158}})
+
+   The following people have contributed to Tika 0.5 by submitting or
+   commenting on the issues resolved in this release:
+
+      * Alex Baranov
+
+      * Bart Hanssens
+
+      * Benson Margulies
+
+      * Chris A. Mattmann
+
+      * Daan de Wit
+
+      * Erik Hetzner
+
+      * Frank Hellwig
+
+      * Jeff Cadow
+
+      * Joachim Zittmayr
+
+      * Jukka Zitting 
+
+      * Julien Nioche
+
+      * Ken Krugler
+
+      * Maxim Valyanskiy
+
+      * MRIT64
+
+      * Paul Borgermans
+
+      * Piotr B.
+
+      * Robert Newson
+
+      * Sascha Szott
+
+      * Ted Dunning
+
+      * Thilo Goetz
+
+      * Uwe Schindler
+
+      * Yuan-Fang Li
+
+   See {{http://tinyurl.com/yl9prwp}} for more details on these contributions.

Propchange: tika/site/src/site/apt/0.5/index.apt
------------------------------------------------------------------------------
    svn:eol-style = native