You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by dm...@apache.org on 2014/03/27 10:46:54 UTC
svn commit: r1582236 [2/14] - in /tika/site: publish/ publish/0.10/ publish/0.5/ publish/0.6/ publish/0.7/ publish/0.8/ publish/0.9/ publish/1.0/ publish/1.1/ publish/1.2/ publish/1.3/ publish/1.4/ publish/1.5/ src/site/

Modified: tika/site/publish/0.10/parser_guide.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.10/parser_guide.html?rev=1582236&r1=1582235&r2=1582236&view=diff
==============================================================================
--- tika/site/publish/0.10/parser_guide.html (original)
+++ tika/site/publish/0.10/parser_guide.html Thu Mar 27 09:46:52 2014
@@ -45,7 +45,7 @@
           }
         }
         if (provider == "lucid") {
-          form.action = "http://search.lucidimagination.com/p:tika";
+          form.action = "http://find.searchhub.org/p:tika";
         } else if (provider == "sl") {
           form.action = "http://search-lucene.com/tika";
         }
@@ -84,9 +84,31 @@
                 width="387" height="100"/></a>
       </div>
       <div id="content">
-        <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Get Tika parsing up and running in 5 minutes<a name="Get_Tika_parsing_up_and_running_in_5_minutes"></
 a></h2><p>This page is a quick start guide showing how to add a new parser to Apache Tika. Following the simple steps listed below your new parser can be running in only 5 minutes.</p><ul><li><a href="#Get_Tika_parsing_up_and_running_in_5_minutes">Get Tika parsing up and running in 5 minutes</a><ul><li><a href="#Getting_Started">Getting Started</a></li><li><a href="#Add_your_MIME-Type">Add your MIME-Type</a></li><li><a href="#Create_your_Parser_class">Create your Parser class</a></li><li><a href="#List_the_new_parser">List the new parser</a></li></ul></li></ul><div class="section"><h3><a name="Getting_Started">Getting Started</a></h3><p>The <a href="./gettingstarted.html">Getting Started</a> document describes how to build Apache Tika from sources and how to start using Tika in an application. Pay close attention and follow the instructions in the &quot;Getting and building the sources&quot; section.</p></div><div class="section"><h3><a name="Add_your_MIME-Type">Add your MIME-Type</
 a></h3><p>You first need to modify <a class="externalLink" href="http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml">tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml</a> in order to Tika can map the file extension with its MIME-Type. You should add something like this:</p><div><pre> &lt;mime-type type=&quot;application/hello&quot;&gt;
+        <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section">
+<h2>Get Tika parsing up and running in 5 minutes<a name="Get_Tika_parsing_up_and_running_in_5_minutes"></a></h2>
+<p>This page is a quick start guide showing how to add a new parser to Apache Tika. Following the simple steps listed below your new parser can be running in only 5 minutes.</p>
+<ul>
+<li><a href="#Get_Tika_parsing_up_and_running_in_5_minutes">Get Tika parsing up and running in 5 minutes</a>
+<ul>
+<li><a href="#Getting_Started">Getting Started</a></li>
+<li><a href="#Add_your_MIME-Type">Add your MIME-Type</a></li>
+<li><a href="#Create_your_Parser_class">Create your Parser class</a></li>
+<li><a href="#List_the_new_parser">List the new parser</a></li></ul></li></ul>
+<div class="section">
+<h3><a name="Getting_Started">Getting Started</a></h3>
+<p>The <a href="./gettingstarted.html">Getting Started</a> document describes how to build Apache Tika from sources and how to start using Tika in an application. Pay close attention and follow the instructions in the &quot;Getting and building the sources&quot; section.</p></div>
+<div class="section">
+<h3><a name="Add_your_MIME-Type">Add your MIME-Type</a></h3>
+<p>You first need to modify <a class="externalLink" href="http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml">tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml</a> in order to Tika can map the file extension with its MIME-Type. You should add something like this:</p>
+<div>
+<pre> &lt;mime-type type=&quot;application/hello&quot;&gt;
         &lt;glob pattern=&quot;*.hi&quot;/&gt;
- &lt;/mime-type&gt;</pre></div></div><div class="section"><h3><a name="Create_your_Parser_class">Create your Parser class</a></h3><p>Now, you need to create your new parser. This is a class that must implement the Parser interface offered by Tika. A very simple Tika Parser looks like this:</p><div><pre>/*
+ &lt;/mime-type&gt;</pre></div></div>
+<div class="section">
+<h3><a name="Create_your_Parser_class">Create your Parser class</a></h3>
+<p>Now, you need to create your new parser. This is a class that must implement the Parser interface offered by Tika. A very simple Tika Parser looks like this:</p>
+<div>
+<pre>/*
  * Licensed to the Apache Software Foundation (ASF) under one or more
  * contributor license agreements.  See the NOTICE file distributed with
  * this work for additional information regarding copyright ownership.
@@ -150,7 +172,13 @@ public class HelloParser implements Pars
                         throws IOException, SAXException, TikaException {
                 parse(stream, handler, metadata, new ParseContext());
         }
-}</pre></div><p>Pay special attention to the definition of the SUPPORTED_TYPES static class field in the parser class that defines what MIME-Types it supports. </p><p>Is in the &quot;parse&quot; method where you will do all your work. This is, extract the information of the resource and then set the metadata.</p></div><div class="section"><h3><a name="List_the_new_parser">List the new parser</a></h3><p>Finally, you should explicitly tell the AutoDetectParser to include your new parser. This step is only needed if you want to use the AutoDetectParser functionality. If you figure out the correct parser in a different way, it isn't needed. </p><p>List your new parser in: <a class="externalLink" href="http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser">tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser</a></p></div></div>
+}</pre></div>
+<p>Pay special attention to the definition of the SUPPORTED_TYPES static class field in the parser class that defines what MIME-Types it supports. </p>
+<p>Is in the &quot;parse&quot; method where you will do all your work. This is, extract the information of the resource and then set the metadata.</p></div>
+<div class="section">
+<h3><a name="List_the_new_parser">List the new parser</a></h3>
+<p>Finally, you should explicitly tell the AutoDetectParser to include your new parser. This step is only needed if you want to use the AutoDetectParser functionality. If you figure out the correct parser in a different way, it isn't needed. </p>
+<p>List your new parser in: <a class="externalLink" href="http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser">tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser</a></p></div></div>
       </div>
       <div id="sidebar">
         <div id="navigation">

Modified: tika/site/publish/0.5/documentation.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.5/documentation.html?rev=1582236&r1=1582235&r2=1582236&view=diff
==============================================================================
--- tika/site/publish/0.5/documentation.html (original)
+++ tika/site/publish/0.5/documentation.html Thu Mar 27 09:46:52 2014
@@ -45,7 +45,7 @@
           }
         }
         if (provider == "lucid") {
-          form.action = "http://search.lucidimagination.com/p:tika";
+          form.action = "http://find.searchhub.org/p:tika";
         } else if (provider == "sl") {
           form.action = "http://search-lucene.com/tika";
         }
@@ -84,27 +84,93 @@
                 width="387" height="100"/></a>
       </div>
       <div id="content">
-        <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Apache Tika Documentation<a name="Apache_Tika_Documentation"></a></h2><p>This document describes the 
 key abstractions and usage of Apache Tika.</p></div><div class="section"><h2>The Parser interface<a name="The_Parser_interface"></a></h2><p>The <a href="./api/org/apache/tika/parser/Parser.html">org.apache.tika.parser.Parser</a> interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:</p><div><pre>void parse(InputStream stream, ContentHandler handler, Metadata metadata)
-    throws IOException, SAXException, TikaException;</pre></div><p>The <tt>parse</tt> method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. The main criteria that lead to this design were:</p><dl><dt>Streamed parsing</dt><dd>The interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. This allows even huge documents to be parsed without excessive resource requirements.</dd><dt>Structured content</dt><dd>A parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. A client application can use this information for example to better judge the relevance of different parts of the parsed document.</dd><dt>Input metadata</dt><dd>A client application should be able to include metadata like the file name or declared content type with the document to be parsed. T
 he parser implementation can use this information to better guide the parsing process.</dd><dt>Output metadata</dt><dd>A parser implementation should be able to return document metadata in addition to document content. Many document formats contain metadata like the name of the author that may be useful to client applications.</dd></dl><p>These criteria are reflected in the arguments of the <tt>parse</tt> method.</p></div><div class="section"><h2>Document input stream<a name="Document_input_stream"></a></h2><p>The first argument is an <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html">InputStream</a> for reading the document to be parsed.</p><p>If this document stream can not be read, then parsing stops and the thrown <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html">IOException</a> is passed up to the client application. If the stream can be read but not parsed (for example if the document is
  corrupted), then the parser throws a <a href="./api/org/apache/tika/exception/TikaException.html">TikaException</a>.</p><p>The parser implementation will consume this stream but <i>will not close it</i>. Closing the stream is the responsibility of the client application that opened it in the first place. The recommended pattern for using streams with the <tt>parse</tt> method is:</p><div><pre>InputStream stream = ...;      // open the stream
+        <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section">
+<h2>Apache Tika Documentation<a name="Apache_Tika_Documentation"></a></h2>
+<p>This document describes the key abstractions and usage of Apache Tika.</p></div>
+<div class="section">
+<h2>The Parser interface<a name="The_Parser_interface"></a></h2>
+<p>The <a href="./api/org/apache/tika/parser/Parser.html">org.apache.tika.parser.Parser</a> interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:</p>
+<div>
+<pre>void parse(InputStream stream, ContentHandler handler, Metadata metadata)
+    throws IOException, SAXException, TikaException;</pre></div>
+<p>The <tt>parse</tt> method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. The main criteria that lead to this design were:</p>
+<dl>
+<dt>Streamed parsing</dt>
+<dd>The interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. This allows even huge documents to be parsed without excessive resource requirements.</dd>
+<dt>Structured content</dt>
+<dd>A parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. A client application can use this information for example to better judge the relevance of different parts of the parsed document.</dd>
+<dt>Input metadata</dt>
+<dd>A client application should be able to include metadata like the file name or declared content type with the document to be parsed. The parser implementation can use this information to better guide the parsing process.</dd>
+<dt>Output metadata</dt>
+<dd>A parser implementation should be able to return document metadata in addition to document content. Many document formats contain metadata like the name of the author that may be useful to client applications.</dd></dl>
+<p>These criteria are reflected in the arguments of the <tt>parse</tt> method.</p></div>
+<div class="section">
+<h2>Document input stream<a name="Document_input_stream"></a></h2>
+<p>The first argument is an <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html">InputStream</a> for reading the document to be parsed.</p>
+<p>If this document stream can not be read, then parsing stops and the thrown <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html">IOException</a> is passed up to the client application. If the stream can be read but not parsed (for example if the document is corrupted), then the parser throws a <a href="./api/org/apache/tika/exception/TikaException.html">TikaException</a>.</p>
+<p>The parser implementation will consume this stream but <i>will not close it</i>. Closing the stream is the responsibility of the client application that opened it in the first place. The recommended pattern for using streams with the <tt>parse</tt> method is:</p>
+<div>
+<pre>InputStream stream = ...;      // open the stream
 try {
     parser.parse(stream, ...); // parse the stream
 } finally {
     stream.close();            // close the stream
-}</pre></div><p>Some document formats like the OLE2 Compound Document Format used by Microsoft Office are best parsed as random access files. In such cases the content of the input stream is automatically spooled to a temporary file that gets removed once parsed. A future version of Tika may make it possible to avoid this extra file if the input document is already a file in the local file system. See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-153">TIKA-153</a> for the status of this feature request.</p></div><div class="section"><h2>XHTML SAX events<a name="XHTML_SAX_events"></a></h2><p>The parsed content of the document stream is returned to the client application as a sequence of XHTML SAX events. XHTML is used to express structured content of the document and SAX events enable streamed processing. Note that the XHTML format is used here only to convey structural information, not to render the documents for browsing!</p><p>The XHTML SAX events produc
 ed by the parser implementation are sent to a <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html">ContentHandler</a> instance given to the <tt>parse</tt> method. If this the content handler fails to process an event, then parsing stops and the thrown <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html">SAXException</a> is passed up to the client application.</p><p>The overall structure of the generated event stream is (with indenting added for clarity):</p><div><pre>&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
+}</pre></div>
+<p>Some document formats like the OLE2 Compound Document Format used by Microsoft Office are best parsed as random access files. In such cases the content of the input stream is automatically spooled to a temporary file that gets removed once parsed. A future version of Tika may make it possible to avoid this extra file if the input document is already a file in the local file system. See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-153">TIKA-153</a> for the status of this feature request.</p></div>
+<div class="section">
+<h2>XHTML SAX events<a name="XHTML_SAX_events"></a></h2>
+<p>The parsed content of the document stream is returned to the client application as a sequence of XHTML SAX events. XHTML is used to express structured content of the document and SAX events enable streamed processing. Note that the XHTML format is used here only to convey structural information, not to render the documents for browsing!</p>
+<p>The XHTML SAX events produced by the parser implementation are sent to a <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html">ContentHandler</a> instance given to the <tt>parse</tt> method. If this the content handler fails to process an event, then parsing stops and the thrown <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html">SAXException</a> is passed up to the client application.</p>
+<p>The overall structure of the generated event stream is (with indenting added for clarity):</p>
+<div>
+<pre>&lt;html xmlns=&quot;http://www.w3.org/1999/xhtml&quot;&gt;
   &lt;head&gt;
     &lt;title&gt;...&lt;/title&gt;
   &lt;/head&gt;
   &lt;body&gt;
     ...
   &lt;/body&gt;
-&lt;/html&gt;</pre></div><p>Parser implementations typically use the <a href="./api/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p><p>Dealing with the raw SAX events can be a bit complex, so Apache Tika (since version 0.2) comes with a number of utility classes that can be used to process and convert the event stream to other representations.</p><p>For example, the <a href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a> class can be used to extract just the body part of the XHTML output and feed it either as SAX events to another content handler or as characters to an output stream, a writer, or simply a string. The following code snippet parses a document from the standard input stream and outputs the extracted text content to standard output:</p><div><pre>ContentHandler handler = new BodyContentHandler(System.out);
-parser.parse(System.in, handler, ...);</pre></div><p>Another useful class is <a href="./api/org/apache/tika/parser/ParsingReader.html">ParsingReader</a> that uses a background thread to parse the document and returns the extracted text content as a character stream:</p><div><pre>InputStream stream = ...; // the document to be parsed
+&lt;/html&gt;</pre></div>
+<p>Parser implementations typically use the <a href="./api/org/apache/tika/sax/XHTMLContentHandler.html">XHTMLContentHandler</a> utility class to generate the XHTML output.</p>
+<p>Dealing with the raw SAX events can be a bit complex, so Apache Tika (since version 0.2) comes with a number of utility classes that can be used to process and convert the event stream to other representations.</p>
+<p>For example, the <a href="./api/org/apache/tika/sax/BodyContentHandler.html">BodyContentHandler</a> class can be used to extract just the body part of the XHTML output and feed it either as SAX events to another content handler or as characters to an output stream, a writer, or simply a string. The following code snippet parses a document from the standard input stream and outputs the extracted text content to standard output:</p>
+<div>
+<pre>ContentHandler handler = new BodyContentHandler(System.out);
+parser.parse(System.in, handler, ...);</pre></div>
+<p>Another useful class is <a href="./api/org/apache/tika/parser/ParsingReader.html">ParsingReader</a> that uses a background thread to parse the document and returns the extracted text content as a character stream:</p>
+<div>
+<pre>InputStream stream = ...; // the document to be parsed
 Reader reader = new ParsingReader(parser, stream, ...);
 try {
     ...;                  // read the document text using the reader
 } finally {
     reader.close();       // the document stream is closed automatically
-}</pre></div></div><div class="section"><h2>Document metadata<a name="Document_metadata"></a></h2><p>The final argument to the <tt>parse</tt> method is used to pass document metadata both in and out of the parser. Document metadata is expressed as an <a href="./api/org/apache/tika/metadata/Metadata.html">Metadata</a> object.</p><p>The following are some of the more interesting metadata properties:</p><dl><dt>Metadata.RESOURCE_NAME_KEY</dt><dd>The name of the file or resource that contains the document.<p>A client application can set this property to allow the parser to use file name heuristics to determine the format of the document.</p><p>The parser implementation may set this property if the file format contains the canonical name of the file (for example the Gzip format has a slot for the file name).</p></dd><dt>Metadata.CONTENT_TYPE</dt><dd>The declared content type of the document.<p>A client application can set this property based on for example a HTTP Content-Type header. The
  declared content type may help the parser to correctly interpret the document.</p><p>The parser implementation sets this property to the content type according to which the document was parsed.</p></dd><dt>Metadata.TITLE</dt><dd>The title of the document.<p>The parser implementation sets this property if the document format contains an explicit title field.</p></dd><dt>Metadata.AUTHOR</dt><dd>The name of the author of the document.<p>The parser implementation sets this property if the document format contains an explicit author field.</p></dd></dl><p>Note that metadata handling is still being discussed by the Tika development team, and it is likely that there will be some (backwards incompatible) changes in metadata handling before Tika 1.0.</p></div><div class="section"><h2>Parser implementations<a name="Parser_implementations"></a></h2><p>Apache Tika comes with a number of parser classes for parsing <a href="./formats.html">various document formats</a>. You can also extend Tika w
 ith your own parsers, and of course any contributions to Tika are warmly welcome.</p><p>The goal of Tika is to reuse existing parser libraries like <a class="externalLink" href="http://www.pdfbox.org/">PDFBox</a> or <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> as much as possible, and so most of the parser classes in Tika are adapters to such external libraries.</p><p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./api/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div>
+}</pre></div></div>
+<div class="section">
+<h2>Document metadata<a name="Document_metadata"></a></h2>
+<p>The final argument to the <tt>parse</tt> method is used to pass document metadata both in and out of the parser. Document metadata is expressed as an <a href="./api/org/apache/tika/metadata/Metadata.html">Metadata</a> object.</p>
+<p>The following are some of the more interesting metadata properties:</p>
+<dl>
+<dt>Metadata.RESOURCE_NAME_KEY</dt>
+<dd>The name of the file or resource that contains the document.
+<p>A client application can set this property to allow the parser to use file name heuristics to determine the format of the document.</p>
+<p>The parser implementation may set this property if the file format contains the canonical name of the file (for example the Gzip format has a slot for the file name).</p></dd>
+<dt>Metadata.CONTENT_TYPE</dt>
+<dd>The declared content type of the document.
+<p>A client application can set this property based on for example a HTTP Content-Type header. The declared content type may help the parser to correctly interpret the document.</p>
+<p>The parser implementation sets this property to the content type according to which the document was parsed.</p></dd>
+<dt>Metadata.TITLE</dt>
+<dd>The title of the document.
+<p>The parser implementation sets this property if the document format contains an explicit title field.</p></dd>
+<dt>Metadata.AUTHOR</dt>
+<dd>The name of the author of the document.
+<p>The parser implementation sets this property if the document format contains an explicit author field.</p></dd></dl>
+<p>Note that metadata handling is still being discussed by the Tika development team, and it is likely that there will be some (backwards incompatible) changes in metadata handling before Tika 1.0.</p></div>
+<div class="section">
+<h2>Parser implementations<a name="Parser_implementations"></a></h2>
+<p>Apache Tika comes with a number of parser classes for parsing <a href="./formats.html">various document formats</a>. You can also extend Tika with your own parsers, and of course any contributions to Tika are warmly welcome.</p>
+<p>The goal of Tika is to reuse existing parser libraries like <a class="externalLink" href="http://www.pdfbox.org/">PDFBox</a> or <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> as much as possible, and so most of the parser classes in Tika are adapters to such external libraries.</p>
+<p>Tika also contains some general purpose parser implementations that are not targeted at any specific document formats. The most notable of these is the <a href="./api/org/apache/tika/parser/AutoDetectParser.html">AutoDetectParser</a> class that encapsulates all Tika functionality into a single parser that can handle any types of documents. This parser will automatically determine the type of the incoming document based on various heuristics and will then parse the document accordingly.</p></div>
       </div>
       <div id="sidebar">
         <div id="navigation">

Modified: tika/site/publish/0.5/formats.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.5/formats.html?rev=1582236&r1=1582235&r2=1582236&view=diff
==============================================================================
--- tika/site/publish/0.5/formats.html (original)
+++ tika/site/publish/0.5/formats.html Thu Mar 27 09:46:52 2014
@@ -45,7 +45,7 @@
           }
         }
         if (provider == "lucid") {
-          form.action = "http://search.lucidimagination.com/p:tika";
+          form.action = "http://find.searchhub.org/p:tika";
         } else if (provider == "sl") {
           form.action = "http://search-lucene.com/tika";
         }
@@ -84,7 +84,117 @@
                 width="387" height="100"/></a>
       </div>
       <div id="content">
-        <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Supported Document Formats<a name="Supported_Document_Formats"></a></h2><p>This page lists all the do
 cument formats supported by Apache Tika.</p><div class="section"><h3>Microsoft's OLE 2 Compound Document format<a name="Microsofts_OLE_2_Compound_Document_format"></a></h3><p>A number of Microsoft applications, most notably the Microsoft Office suite, use the generic OLE 2 Compound Document format as the basis of their document formats. Tika uses <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> to support a number of these formats.</p><p>The OLE2 Compound Document format is designed for use with random access files, and so the input stream passed to a Tika parser needs to be spooled in memory or in a temporary file depending on the size of the document. See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-153">TIKA-153</a> for an effort to avoid this extra temporary file if the input document already comes from a file.</p><p>In addition to the shared base format there's also a shared sets of metadata in typical OLE2 documents. Tika uses th
 e <a class="externalLink" href="http://poi.apache.org/hpsf/">HPSF library</a> from POI to parse these property sets and exposes them as the following document metadata:</p><ul><li><tt>TITLE</tt> Title</li><li><tt>SUBJECT</tt> Subject</li><li><tt>AUTHOR</tt> Author</li><li><tt>KEYWORDS</tt> Keywords</li><li><tt>COMMENTS</tt> Comments</li><li><tt>TEMPLATE</tt> Template</li><li><tt>LAST_SAVED</tt> Last Saved By</li><li><tt>REVISION_NUMBER</tt> Revision Number</li><li><tt>LAST_PRINTED</tt> Last Printed</li><li><tt>LAST_SAVED</tt> Last Saved Time/Date</li><li><tt>LAST_SAVED</tt> Last Saved Time/Date</li><li><tt>PAGE_COUNT</tt> Number of Pages</li><li><tt>WORD_COUNT</tt> Number of Words</li><li><tt>CHARACTER_COUNT</tt> Number of Characters</li><li><tt>APPLICATION_NAME</tt> Name of Creating Application</li></ul><p>Note that in practice the metadata in many documents is either missing, incomplete or even incorrect, so a client application should not rely too much on this information.</p><p>
 Support for the new Office Open XML format used by Microsoft Office version 2007 is pending for a POI upgrade. Current status is recorded in <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-152">TIKA-152</a>.</p><p>The generic OLE2 Compound Document format is automatically detected using a magic number, and further parsing can automatically determine the more specific document format. Tika also knows a number of common glob patterns like <tt>*.doc</tt> and <tt>*.ppt</tt> for these formats.</p><p>The supported OLE 2 Compound Document formats are:</p><dl><dt>Microsoft Excel (application/vnd.ms-excel)</dt><dd> Excel spreadsheet support is available in all versions of Tika and is based on the <a class="externalLink" href="http://poi.apache.org/hssf/">HSSF library</a> from POI.<p>The Excel parser in Tika uses the <a class="externalLink" href="http://poi.apache.org/hssf/how-to.html#event_api">HSSF event API</a> and is able to extract much of the document structure,
  including all (non-empty) worksheets and their table structures. Formula results are extracted as stored in the Excel file, and cell links are exposed as XHTML links. These features were added in Tika version 0.2.</p><p>Cell comments and formatting are currently not supported. See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-148">TIKA-148</a> and <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-103">TIKA-103</a> for the respective issues.</p></dd><dt>Microsoft Word (application/msword)</dt><dd> Word document support is available in all versions of Tika and is based on the <a class="externalLink" href="http://poi.apache.org/hwpf/">HWPF library</a> from POI.<p>The Word parser uses the <a class="externalLink" href="http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html">WordExtractor</a> class from HWPF to extract document content as a sequence of paragraphs.</p></dd><dt>Microsoft PowerPoint (application/vnd.m
 s-powerpoint)</dt><dd> PowerPoint presentation support is available in all versions of Tika and is based on the <a class="externalLink" href="http://poi.apache.org/hslf/">HSLF library</a> from POI.<p>The PowerPoint parser uses the <a class="externalLink" href="http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html">PowerPointExtractor</a> class from HSLF to extract spreadsheet content as a single paragraph.</p></dd><dt>Microsoft Visio (application/vnd.visio)</dt><dd> Visio diagram support was added in Tika version 0.2 and is based on the <a class="externalLink" href="http://poi.apache.org/hdgf/">HDGF library</a> from POI.<p>The Visio parser uses the <a class="externalLink" href="http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html">VisioExtractor</a> class from HDGF to extract diagram content as a sequence of paragraphs.</p></dd><dt>Microsoft Outlook (application/vnd.ms-outlook)</dt><dd> Outlook message support was added i
 n Tika version 0.2 and is based on the <a class="externalLink" href="http://poi.apache.org/hsmf/">HSMF library</a> from POI.<p>The Outlook parser extracts the subject of the message and the From, To, Cc, and Bcc addresses (formatted for display) along with the body text of text/plain messages. The <tt>AUTHOR</tt>, <tt>TITLE</tt> and <tt>SUBJECT</tt> metadata properties are set explicitly, overriding potential generic document metadata retrieved from OLE2 property sets.</p></dd></dl></div><div class="section"><h3>Compression formats<a name="Compression_formats"></a></h3><p>General purpose compression formats are used to reduce the size of any kinds of documents. Tika uses a parsing pipeline to support general purpose compression: in the first stage the compressed stream decompressed and the resulting decompressed stream is passed on to a second parsing stage where it will be processed as if the document had never been compressed.</p><p>Tika contains magic numbers and glob patterns fo
 r auto-detecting all supported compression formats. The glob patterns of compression formats are also used to determine the name of the original uncompressed document. If a client application has supplied a <tt>RESOURCE_NAME_KEY</tt> metadata property that matches such a glob pattern, then the decompressing first parsing stage will replace the <tt>RESOURCE_NAME_KEY</tt> metadata property with the deduced original document name before passing control to the second parsing stage.</p><p>Note that apart from the special handling of the <tt>RESOURCE_NAME_KEY</tt> property, no document metadata is passed to or from the second parsing stage. Only the text content extracted by the second stage parser is returned to the client application.</p><p>The supported compression formats are:</p><dl><dt>gzip compression (application/x-gzip)</dt><dd> <a class="externalLink" href="http://en.wikipedia.org/wiki/Gzip">Gzip</a> support was added in Tika version 0.2 and is based on the <a class="externalLin
 k" href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html">GZIPInputStream</a> class in the Java 5 class library.<p>The known gzip glob patterns are <tt>*.tgz</tt>, <tt>*.gz</tt> and <tt>*-gz</tt>, and they will respectively be replaced with <tt>*.tar</tt>, <tt>*</tt> and <tt>*</tt> as described above.</p></dd><dt>bzip2 compression (application/x-bzip)</dt><dd> <a class="externalLink" href="http://en.wikipedia.org/wiki/Bzip2">Bzip2</a> support was added in Tika version 0.2 and is based on bzip2 parsing code from <a class="externalLink" href="http://ant.apache.org/">Apache Ant</a>, which in turn was originally based on work by Keiron Liddle from Aftex Software.<p>The known bzip2 glob patterns are <tt>*.tbz</tt>, <tt>*.tbz2</tt>, <tt>*.bz</tt> and <tt>*.bz2</tt>, and they will respectively be replaced with <tt>*.tar</tt>, <tt>*.tar</tt>, <tt>*</tt> and <tt>*</tt> as described above.</p></dd></dl></div><div class="section"><h3>Audio formats<a name="Audio_forma
 ts"></a></h3><p>Tika can detect several common audio formats and extract metadata from them. Text extraction is supported for some MIDI-based karaoke formats that contain the lyrics of the encoded audio.</p><p>See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-94">TIKA-94</a> for an effort to integrate speech recognition support to Tika.</p><dl><dt>MP3 Audio (audio/mpeg)</dt><dd> The parsing of <a class="externalLink" href="http://www.id3.org/ID3v1">ID3v1</a> tags from MP3 files was added in Tika version 0.2. If found the following metadata is extracted and set:<ul><li><tt>TITLE</tt> Title</li><li><tt>SUBJECT</tt> Subject</li></ul><p>The above information, as well as the <tt>Album</tt>, <tt>Track</tt>, <tt>Year</tt>, <tt>Genre</tt> and additional <tt>Comment</tt> are extracted when set in the file.</p></dd><dt>MIDI audio (audio/midi)</dt><dd> Tika uses the MIDI support in <tt>javax.audio.midi</tt> to parse MIDI sequence files. Many karaoke file formats are 
 based on MIDI, and contain lyrics as embedded text tracks that Tika knows how to extract.<p>Support for MIDI files was added in Tika 0.3.</p></dd><dt>Wave audio (audio/basic)</dt><dd> Tika supports sampled wave audio (.wav files, etc.) using the <tt>javax.audio.sampled</tt> package. Only sampling metadata is extracted.<p>Support for sampled wave audio was added in Tika 0.3. </p></dd></dl></div><div class="section"><h3>Other supported formats<a name="Other_supported_formats"></a></h3><dl><dt>Extensible Markup Language (application/xml)</dt><dd> Tika uses the <tt>javax.xml</tt> classes to parse Extensible Markup Language files. Support for Extensible Markup Language files was added in Tika 0.1.</dd><dt>HyperText Markup Language (text/html)</dt><dd> Tika uses the <a class="externalLink" href="http://sourceforge.net/projects/nekohtml">CyberNeko</a> library to parse HyperText Markup Language files. Support for HyperText Markup Language files was added in Tika 0.1.</dd><dt>Images (image/*
 )</dt><dd> Tika uses the <tt>javax.imageio</tt> classes to extract metadata from image files.<p>Support for Image files was added in Tika 0.2.</p></dd><dt>Java class files</dt><dd> The parsing of Java Class files is based on the asm library and work by Dave Brosius in JCR-1522.<p>Support for Java Class files was added in Tika 0.2.</p></dd><dt>Java jar archives</dt><dd> The parsing of Java JAR archives is performed using a combination of the ZIP and Java class file parsers.<p>Support for Java JAR archives was added in Tika 0.2.</p></dd><dt>OpenDocument (application/vnd.oasis.opendocument.*)</dt><dd> Tika uses the built-in ZIP and XML features in Java to parse the <a class="externalLink" href="http://en.wikipedia.org/wiki/OpenDocument">OpenDocument</a> document types used most notably by OpenOffice 2.0 and higher. The older OpenOffice 1.0 formats are also supported, though they are currently not auto-detected as well as the newer formats.<p>Support for the OpenDocument formats was add
 ed in Tika 0.3.</p></dd><dt>Plain text (text/plain)</dt><dd> Tika uses the <a class="externalLink" href="http://www.icu-project.org/">International Components for Unicode</a> Java library (ICU4J) to parse plain text. Support for plain text was added in Tika 0.1.<p>Extracting text content from plain text files is actually a relatively complex task due to the fact that the character encoding of the text file is often unknown to the parser.</p><p>The text parser in Tika uses the ICU4J <a class="externalLink" href="http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html">CharsetDetector</a> class to automatically detect the character encoding of any text input. As an added benefit, the ICU4J library is in some cases able to detect also the language in which the text is written.</p><p>The character encoding and language of the plain text document are returned as the <tt>Metadata.CONTENT_ENCODING</tt> and <tt>Metadata.LANGUAGE</tt> metadata properties. If the (declar
 ed) content encoding of a text document is already known to the client application, then it can be supplied as the <tt>Metadata.CONTENT_ENCODING</tt> metadata property to the parser to simplify encoding detection.</p></dd><dt>Portable Document Format (application/pdf)</dt><dd> Tika uses the <a class="externalLink" href="http://www.pdfbox.org">PDFBox</a> library to parse Portable Document Format (PDF) documents.<p>Support for PDF was added in Tika 0.1.</p></dd><dt>Rich Text Format (application/rtf)</dt><dd> Tika uses Java's built-in Swing library to parse Rich Text Format (RTF) documents. Support for RTF was added in Tika 0.1.<p>The RTF parser in Tika uses the Swing <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/text/rtf/RTFEditorKit.html">RTFEditorKit</a> class to extract all text from an RTF document as a single paragraph. Document metadata extraction is currently not supported.</p></dd><dt>tar archive (application/x-tar)</dt><dd> Tika uses an ada
 pted version of the tar parsing code from <a class="externalLink" href="http://ant.apache.org/">Apache Ant</a> to parse tar archives. The tar code is originally based on work by Timothy Gerard Endres.<p>Support for tar archives was added in Tika 0.2.</p></dd><dt>ZIP archive (application/zip)</dt><dd> Tika uses Java's built-in Zip classes to parse ZIP files.<p>Support for ZIP was added in Tika 0.2.</p></dd></dl></div></div>
+        <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section">
+<h2>Supported Document Formats<a name="Supported_Document_Formats"></a></h2>
+<p>This page lists all the document formats supported by Apache Tika.</p>
+<div class="section">
+<h3>Microsoft's OLE 2 Compound Document format<a name="Microsofts_OLE_2_Compound_Document_format"></a></h3>
+<p>A number of Microsoft applications, most notably the Microsoft Office suite, use the generic OLE 2 Compound Document format as the basis of their document formats. Tika uses <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> to support a number of these formats.</p>
+<p>The OLE2 Compound Document format is designed for use with random access files, and so the input stream passed to a Tika parser needs to be spooled in memory or in a temporary file depending on the size of the document. See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-153">TIKA-153</a> for an effort to avoid this extra temporary file if the input document already comes from a file.</p>
+<p>In addition to the shared base format there's also a shared sets of metadata in typical OLE2 documents. Tika uses the <a class="externalLink" href="http://poi.apache.org/hpsf/">HPSF library</a> from POI to parse these property sets and exposes them as the following document metadata:</p>
+<ul>
+<li><tt>TITLE</tt> Title</li>
+<li><tt>SUBJECT</tt> Subject</li>
+<li><tt>AUTHOR</tt> Author</li>
+<li><tt>KEYWORDS</tt> Keywords</li>
+<li><tt>COMMENTS</tt> Comments</li>
+<li><tt>TEMPLATE</tt> Template</li>
+<li><tt>LAST_SAVED</tt> Last Saved By</li>
+<li><tt>REVISION_NUMBER</tt> Revision Number</li>
+<li><tt>LAST_PRINTED</tt> Last Printed</li>
+<li><tt>LAST_SAVED</tt> Last Saved Time/Date</li>
+<li><tt>LAST_SAVED</tt> Last Saved Time/Date</li>
+<li><tt>PAGE_COUNT</tt> Number of Pages</li>
+<li><tt>WORD_COUNT</tt> Number of Words</li>
+<li><tt>CHARACTER_COUNT</tt> Number of Characters</li>
+<li><tt>APPLICATION_NAME</tt> Name of Creating Application</li></ul>
+<p>Note that in practice the metadata in many documents is either missing, incomplete or even incorrect, so a client application should not rely too much on this information.</p>
+<p>Support for the new Office Open XML format used by Microsoft Office version 2007 is pending for a POI upgrade. Current status is recorded in <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-152">TIKA-152</a>.</p>
+<p>The generic OLE2 Compound Document format is automatically detected using a magic number, and further parsing can automatically determine the more specific document format. Tika also knows a number of common glob patterns like <tt>*.doc</tt> and <tt>*.ppt</tt> for these formats.</p>
+<p>The supported OLE 2 Compound Document formats are:</p>
+<dl>
+<dt>Microsoft Excel (application/vnd.ms-excel)</dt>
+<dd> Excel spreadsheet support is available in all versions of Tika and is based on the <a class="externalLink" href="http://poi.apache.org/hssf/">HSSF library</a> from POI.
+<p>The Excel parser in Tika uses the <a class="externalLink" href="http://poi.apache.org/hssf/how-to.html#event_api">HSSF event API</a> and is able to extract much of the document structure, including all (non-empty) worksheets and their table structures. Formula results are extracted as stored in the Excel file, and cell links are exposed as XHTML links. These features were added in Tika version 0.2.</p>
+<p>Cell comments and formatting are currently not supported. See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-148">TIKA-148</a> and <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-103">TIKA-103</a> for the respective issues.</p></dd>
+<dt>Microsoft Word (application/msword)</dt>
+<dd> Word document support is available in all versions of Tika and is based on the <a class="externalLink" href="http://poi.apache.org/hwpf/">HWPF library</a> from POI.
+<p>The Word parser uses the <a class="externalLink" href="http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html">WordExtractor</a> class from HWPF to extract document content as a sequence of paragraphs.</p></dd>
+<dt>Microsoft PowerPoint (application/vnd.ms-powerpoint)</dt>
+<dd> PowerPoint presentation support is available in all versions of Tika and is based on the <a class="externalLink" href="http://poi.apache.org/hslf/">HSLF library</a> from POI.
+<p>The PowerPoint parser uses the <a class="externalLink" href="http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html">PowerPointExtractor</a> class from HSLF to extract spreadsheet content as a single paragraph.</p></dd>
+<dt>Microsoft Visio (application/vnd.visio)</dt>
+<dd> Visio diagram support was added in Tika version 0.2 and is based on the <a class="externalLink" href="http://poi.apache.org/hdgf/">HDGF library</a> from POI.
+<p>The Visio parser uses the <a class="externalLink" href="http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html">VisioExtractor</a> class from HDGF to extract diagram content as a sequence of paragraphs.</p></dd>
+<dt>Microsoft Outlook (application/vnd.ms-outlook)</dt>
+<dd> Outlook message support was added in Tika version 0.2 and is based on the <a class="externalLink" href="http://poi.apache.org/hsmf/">HSMF library</a> from POI.
+<p>The Outlook parser extracts the subject of the message and the From, To, Cc, and Bcc addresses (formatted for display) along with the body text of text/plain messages. The <tt>AUTHOR</tt>, <tt>TITLE</tt> and <tt>SUBJECT</tt> metadata properties are set explicitly, overriding potential generic document metadata retrieved from OLE2 property sets.</p></dd></dl></div>
+<div class="section">
+<h3>Compression formats<a name="Compression_formats"></a></h3>
+<p>General purpose compression formats are used to reduce the size of any kinds of documents. Tika uses a parsing pipeline to support general purpose compression: in the first stage the compressed stream decompressed and the resulting decompressed stream is passed on to a second parsing stage where it will be processed as if the document had never been compressed.</p>
+<p>Tika contains magic numbers and glob patterns for auto-detecting all supported compression formats. The glob patterns of compression formats are also used to determine the name of the original uncompressed document. If a client application has supplied a <tt>RESOURCE_NAME_KEY</tt> metadata property that matches such a glob pattern, then the decompressing first parsing stage will replace the <tt>RESOURCE_NAME_KEY</tt> metadata property with the deduced original document name before passing control to the second parsing stage.</p>
+<p>Note that apart from the special handling of the <tt>RESOURCE_NAME_KEY</tt> property, no document metadata is passed to or from the second parsing stage. Only the text content extracted by the second stage parser is returned to the client application.</p>
+<p>The supported compression formats are:</p>
+<dl>
+<dt>gzip compression (application/x-gzip)</dt>
+<dd> <a class="externalLink" href="http://en.wikipedia.org/wiki/Gzip">Gzip</a> support was added in Tika version 0.2 and is based on the <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html">GZIPInputStream</a> class in the Java 5 class library.
+<p>The known gzip glob patterns are <tt>*.tgz</tt>, <tt>*.gz</tt> and <tt>*-gz</tt>, and they will respectively be replaced with <tt>*.tar</tt>, <tt>*</tt> and <tt>*</tt> as described above.</p></dd>
+<dt>bzip2 compression (application/x-bzip)</dt>
+<dd> <a class="externalLink" href="http://en.wikipedia.org/wiki/Bzip2">Bzip2</a> support was added in Tika version 0.2 and is based on bzip2 parsing code from <a class="externalLink" href="http://ant.apache.org/">Apache Ant</a>, which in turn was originally based on work by Keiron Liddle from Aftex Software.
+<p>The known bzip2 glob patterns are <tt>*.tbz</tt>, <tt>*.tbz2</tt>, <tt>*.bz</tt> and <tt>*.bz2</tt>, and they will respectively be replaced with <tt>*.tar</tt>, <tt>*.tar</tt>, <tt>*</tt> and <tt>*</tt> as described above.</p></dd></dl></div>
+<div class="section">
+<h3>Audio formats<a name="Audio_formats"></a></h3>
+<p>Tika can detect several common audio formats and extract metadata from them. Text extraction is supported for some MIDI-based karaoke formats that contain the lyrics of the encoded audio.</p>
+<p>See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-94">TIKA-94</a> for an effort to integrate speech recognition support to Tika.</p>
+<dl>
+<dt>MP3 Audio (audio/mpeg)</dt>
+<dd> The parsing of <a class="externalLink" href="http://www.id3.org/ID3v1">ID3v1</a> tags from MP3 files was added in Tika version 0.2. If found the following metadata is extracted and set:
+<ul>
+<li><tt>TITLE</tt> Title</li>
+<li><tt>SUBJECT</tt> Subject</li></ul>
+<p>The above information, as well as the <tt>Album</tt>, <tt>Track</tt>, <tt>Year</tt>, <tt>Genre</tt> and additional <tt>Comment</tt> are extracted when set in the file.</p></dd>
+<dt>MIDI audio (audio/midi)</dt>
+<dd> Tika uses the MIDI support in <tt>javax.audio.midi</tt> to parse MIDI sequence files. Many karaoke file formats are based on MIDI, and contain lyrics as embedded text tracks that Tika knows how to extract.
+<p>Support for MIDI files was added in Tika 0.3.</p></dd>
+<dt>Wave audio (audio/basic)</dt>
+<dd> Tika supports sampled wave audio (.wav files, etc.) using the <tt>javax.audio.sampled</tt> package. Only sampling metadata is extracted.
+<p>Support for sampled wave audio was added in Tika 0.3. </p></dd></dl></div>
+<div class="section">
+<h3>Other supported formats<a name="Other_supported_formats"></a></h3>
+<dl>
+<dt>Extensible Markup Language (application/xml)</dt>
+<dd> Tika uses the <tt>javax.xml</tt> classes to parse Extensible Markup Language files. Support for Extensible Markup Language files was added in Tika 0.1.</dd>
+<dt>HyperText Markup Language (text/html)</dt>
+<dd> Tika uses the <a class="externalLink" href="http://sourceforge.net/projects/nekohtml">CyberNeko</a> library to parse HyperText Markup Language files. Support for HyperText Markup Language files was added in Tika 0.1.</dd>
+<dt>Images (image/*)</dt>
+<dd> Tika uses the <tt>javax.imageio</tt> classes to extract metadata from image files.
+<p>Support for Image files was added in Tika 0.2.</p></dd>
+<dt>Java class files</dt>
+<dd> The parsing of Java Class files is based on the asm library and work by Dave Brosius in JCR-1522.
+<p>Support for Java Class files was added in Tika 0.2.</p></dd>
+<dt>Java jar archives</dt>
+<dd> The parsing of Java JAR archives is performed using a combination of the ZIP and Java class file parsers.
+<p>Support for Java JAR archives was added in Tika 0.2.</p></dd>
+<dt>OpenDocument (application/vnd.oasis.opendocument.*)</dt>
+<dd> Tika uses the built-in ZIP and XML features in Java to parse the <a class="externalLink" href="http://en.wikipedia.org/wiki/OpenDocument">OpenDocument</a> document types used most notably by OpenOffice 2.0 and higher. The older OpenOffice 1.0 formats are also supported, though they are currently not auto-detected as well as the newer formats.
+<p>Support for the OpenDocument formats was added in Tika 0.3.</p></dd>
+<dt>Plain text (text/plain)</dt>
+<dd> Tika uses the <a class="externalLink" href="http://www.icu-project.org/">International Components for Unicode</a> Java library (ICU4J) to parse plain text. Support for plain text was added in Tika 0.1.
+<p>Extracting text content from plain text files is actually a relatively complex task due to the fact that the character encoding of the text file is often unknown to the parser.</p>
+<p>The text parser in Tika uses the ICU4J <a class="externalLink" href="http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html">CharsetDetector</a> class to automatically detect the character encoding of any text input. As an added benefit, the ICU4J library is in some cases able to detect also the language in which the text is written.</p>
+<p>The character encoding and language of the plain text document are returned as the <tt>Metadata.CONTENT_ENCODING</tt> and <tt>Metadata.LANGUAGE</tt> metadata properties. If the (declared) content encoding of a text document is already known to the client application, then it can be supplied as the <tt>Metadata.CONTENT_ENCODING</tt> metadata property to the parser to simplify encoding detection.</p></dd>
+<dt>Portable Document Format (application/pdf)</dt>
+<dd> Tika uses the <a class="externalLink" href="http://www.pdfbox.org">PDFBox</a> library to parse Portable Document Format (PDF) documents.
+<p>Support for PDF was added in Tika 0.1.</p></dd>
+<dt>Rich Text Format (application/rtf)</dt>
+<dd> Tika uses Java's built-in Swing library to parse Rich Text Format (RTF) documents. Support for RTF was added in Tika 0.1.
+<p>The RTF parser in Tika uses the Swing <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/text/rtf/RTFEditorKit.html">RTFEditorKit</a> class to extract all text from an RTF document as a single paragraph. Document metadata extraction is currently not supported.</p></dd>
+<dt>tar archive (application/x-tar)</dt>
+<dd> Tika uses an adapted version of the tar parsing code from <a class="externalLink" href="http://ant.apache.org/">Apache Ant</a> to parse tar archives. The tar code is originally based on work by Timothy Gerard Endres.
+<p>Support for tar archives was added in Tika 0.2.</p></dd>
+<dt>ZIP archive (application/zip)</dt>
+<dd> Tika uses Java's built-in Zip classes to parse ZIP files.
+<p>Support for ZIP was added in Tika 0.2.</p></dd></dl></div></div>
       </div>
       <div id="sidebar">
         <div id="navigation">

Modified: tika/site/publish/0.5/gettingstarted.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.5/gettingstarted.html?rev=1582236&r1=1582235&r2=1582236&view=diff
==============================================================================
--- tika/site/publish/0.5/gettingstarted.html (original)
+++ tika/site/publish/0.5/gettingstarted.html Thu Mar 27 09:46:52 2014
@@ -45,7 +45,7 @@
           }
         }
         if (provider == "lucid") {
-          form.action = "http://search.lucidimagination.com/p:tika";
+          form.action = "http://find.searchhub.org/p:tika";
         } else if (provider == "sl") {
           form.action = "http://search-lucene.com/tika";
         }
@@ -84,15 +84,48 @@
                 width="387" height="100"/></a>
       </div>
       <div id="content">
-        <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Getting Started with Apache Tika<a name="Getting_Started_with_Apache_Tika"></a></h2><p>This document 
 describes how to build Apache Tika from sources and how to start using Tika in an application.</p></div><div class="section"><h2>Getting and building the sources<a name="Getting_and_building_the_sources"></a></h2><p>To build Tika from sources you first need to either <a href="../download.html">download</a> a source release or <a href="../source-repository.html">checkout</a> the latest sources from version control.</p><p>Once you have the sources, you can build them using the <a class="externalLink" href="http://maven.apache.org/">Maven 2</a> build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository.</p><div><pre>mvn install</pre></div><p>See the Maven documentation for more information about the available build options.</p><p>Note that you need Java 5 or higher to build Tika.</p></div><div class="section"><h2>Build artifacts<a name="Build_artifacts"></a></h2><p>Starting with Tika 0.5,
  the build consists of a number of components and produces the following main binaries (x.y stands for the current Tika version number):</p><dl><dt>tika-core/target/tika-core-x.y.jar</dt><dd> Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5.</dd><dt>tika-core/target/tika-core-x.y-jdk14.jar</dt><dd> Java 1.4 version of the Tika core library.</dd><dt>tika-parsers/target/tika-parsers-x.y.jar</dt><dd> Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.</dd><dt>tika-app/target/tika-app-x.y.jar</dt><dd> Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.</dd></dl></div><div class="section"><h2>Using Tika as a Maven dependency<a name="Using_Tika_as_a_Maven_dependency"></a></h2><p>Since the 0.5 release Tika has been split to components to giv
 e you more control over which parts of Tika you want to use in your application. The core library, tika-core, contains the key interfaces and classes, so you'll always want to include a dependency to it:</p><div><pre>  &lt;dependency&gt;
+        <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section">
+<h2>Getting Started with Apache Tika<a name="Getting_Started_with_Apache_Tika"></a></h2>
+<p>This document describes how to build Apache Tika from sources and how to start using Tika in an application.</p></div>
+<div class="section">
+<h2>Getting and building the sources<a name="Getting_and_building_the_sources"></a></h2>
+<p>To build Tika from sources you first need to either <a href="../download.html">download</a> a source release or <a href="../source-repository.html">checkout</a> the latest sources from version control.</p>
+<p>Once you have the sources, you can build them using the <a class="externalLink" href="http://maven.apache.org/">Maven 2</a> build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository.</p>
+<div>
+<pre>mvn install</pre></div>
+<p>See the Maven documentation for more information about the available build options.</p>
+<p>Note that you need Java 5 or higher to build Tika.</p></div>
+<div class="section">
+<h2>Build artifacts<a name="Build_artifacts"></a></h2>
+<p>Starting with Tika 0.5, the build consists of a number of components and produces the following main binaries (x.y stands for the current Tika version number):</p>
+<dl>
+<dt>tika-core/target/tika-core-x.y.jar</dt>
+<dd> Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5.</dd>
+<dt>tika-core/target/tika-core-x.y-jdk14.jar</dt>
+<dd> Java 1.4 version of the Tika core library.</dd>
+<dt>tika-parsers/target/tika-parsers-x.y.jar</dt>
+<dd> Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.</dd>
+<dt>tika-app/target/tika-app-x.y.jar</dt>
+<dd> Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.</dd></dl></div>
+<div class="section">
+<h2>Using Tika as a Maven dependency<a name="Using_Tika_as_a_Maven_dependency"></a></h2>
+<p>Since the 0.5 release Tika has been split to components to give you more control over which parts of Tika you want to use in your application. The core library, tika-core, contains the key interfaces and classes, so you'll always want to include a dependency to it:</p>
+<div>
+<pre>  &lt;dependency&gt;
     &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
     &lt;artifactId&gt;tika-core&lt;/artifactId&gt;
     &lt;version&gt;x.y&lt;/version&gt;  &lt;!-- 0.5 or higher --&gt;
-  &lt;/dependency&gt;</pre></div><p>This dependency only gives you basic Tika functionality without any of the parser libraries. If you want to use Tika to parse documents (instead of simply detecting document types, etc.), you also need the tika-parsers dependency: </p><div><pre>  &lt;dependency&gt;
+  &lt;/dependency&gt;</pre></div>
+<p>This dependency only gives you basic Tika functionality without any of the parser libraries. If you want to use Tika to parse documents (instead of simply detecting document types, etc.), you also need the tika-parsers dependency: </p>
+<div>
+<pre>  &lt;dependency&gt;
     &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;
     &lt;artifactId&gt;tika-parsers&lt;/artifactId&gt;
     &lt;version&gt;x.y&lt;/version&gt;  &lt;!-- same version as in tika-core --&gt;
-  &lt;/dependency&gt;</pre></div><p>Note that adding this dependency will introduce a number of transitive dependencies to your project. You need to make sure that these dependencies won't conflict with your existing project dependencies. The listing below shows all the compile-scope dependencies of the current Tika parsers release (0.5, November 2009). You can use the command &quot;mvn dependency:tree&quot; to check the latest tree of dependencies on any one of Tika's core, parsers and app projects.</p><div><pre>org.apache.tika:tika-parent:pom:0.5
+  &lt;/dependency&gt;</pre></div>
+<p>Note that adding this dependency will introduce a number of transitive dependencies to your project. You need to make sure that these dependencies won't conflict with your existing project dependencies. The listing below shows all the compile-scope dependencies of the current Tika parsers release (0.5, November 2009). You can use the command &quot;mvn dependency:tree&quot; to check the latest tree of dependencies on any one of Tika's core, parsers and app projects.</p>
+<div>
+<pre>org.apache.tika:tika-parent:pom:0.5
 org.apache.tika:tika-core:bundle:0.5
 \- junit:junit:jar:3.8.1:test
 org.apache.tika:tika-parsers:bundle:0.5
@@ -137,7 +170,12 @@ org.apache.tika:tika-app:bundle:0.5
    +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2:provided
    +- asm:asm:jar:3.1:provided
    +- log4j:log4j:jar:1.2.14:provided
-   \- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:provided</pre></div></div><div class="section"><h2>Using Tika in an Ant project<a name="Using_Tika_in_an_Ant_project"></a></h2><p>Unless you use a dependency manager tool like <a class="externalLink" href="http://ant.apache.org/ivy/">Apache Ivy</a>, to use Tika in you application you can include the Tika jar files and the dependencies individually.</p><div><pre>&lt;classpath&gt;
+   \- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:provided</pre></div></div>
+<div class="section">
+<h2>Using Tika in an Ant project<a name="Using_Tika_in_an_Ant_project"></a></h2>
+<p>Unless you use a dependency manager tool like <a class="externalLink" href="http://ant.apache.org/ivy/">Apache Ivy</a>, to use Tika in you application you can include the Tika jar files and the dependencies individually.</p>
+<div>
+<pre>&lt;classpath&gt;
   ... &lt;!-- your other classpath entries --&gt;
   &lt;pathelement location=&quot;path/to/tika-core-0.5.jar&quot;/&gt;
   &lt;pathelement location=&quot;path/to/tika-parsers-0.5.jar&quot;/&gt;
@@ -160,7 +198,15 @@ org.apache.tika:tika-app:bundle:0.5
   &lt;pathelement location=&quot;path/to/geronimo-stax-api_1.0_spec-1.0.jar&quot;/&gt;
   &lt;pathelement location=&quot;path/to/asm-3.1.jar&quot;/&gt;
   &lt;pathelement location=&quot;path/to/log4j-1.2.14.jar&quot;/&gt;
-&lt;/classpath&gt;</pre></div><p>An easy way to gather all these libraries is to run &quot;mvn dependency:copy-dependencies&quot; in the Tika source directory. This will copy all Tika dependencies to the <tt>target/dependencies</tt> directory.</p><p>Alternatively you can simply drop the entire tika-app jar to your classpath to get all of the above dependencies in a single archive.</p></div><div class="section"><h2>Using Tika as a command line utility<a name="Using_Tika_as_a_command_line_utility"></a></h2><p>The Tika application jar (tika-app-x.y.jar) can be used as a command line utility for extracting text content and metadata from all sorts of files. This runnable jar contains all the dependencies it needs, so you don't need to worry about classpath settings to run it.</p><p>The usage instructions are shown below.</p><div><pre>usage: java -jar tika-app-x.y.jar [option] [file]
+&lt;/classpath&gt;</pre></div>
+<p>An easy way to gather all these libraries is to run &quot;mvn dependency:copy-dependencies&quot; in the Tika source directory. This will copy all Tika dependencies to the <tt>target/dependencies</tt> directory.</p>
+<p>Alternatively you can simply drop the entire tika-app jar to your classpath to get all of the above dependencies in a single archive.</p></div>
+<div class="section">
+<h2>Using Tika as a command line utility<a name="Using_Tika_as_a_command_line_utility"></a></h2>
+<p>The Tika application jar (tika-app-x.y.jar) can be used as a command line utility for extracting text content and metadata from all sorts of files. This runnable jar contains all the dependencies it needs, so you don't need to worry about classpath settings to run it.</p>
+<p>The usage instructions are shown below.</p>
+<div>
+<pre>usage: java -jar tika-app-x.y.jar [option] [file]
 
 Options:
     -? or --help       Print this usage message
@@ -186,7 +232,10 @@ Description:
     Use the &quot;--gui&quot; (or &quot;-g&quot;) option to start
     the Apache Tika GUI. You can drag and drop files
     from a normal file explorer to the GUI window to
-    extract text content and metadata from the files.</pre></div><p>You can also use the jar as a component in a Unix pipeline or as an external tool in many scripting languages.</p><div><pre># Check if an Internet resource contains a specific keyword
+    extract text content and metadata from the files.</pre></div>
+<p>You can also use the jar as a component in a Unix pipeline or as an external tool in many scripting languages.</p>
+<div>
+<pre># Check if an Internet resource contains a specific keyword
 curl http://.../document.doc \
   | java -jar tika-app-x.y.jar --text \
   | grep -q keyword</pre></div></div>

Modified: tika/site/publish/0.5/index.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.5/index.html?rev=1582236&r1=1582235&r2=1582236&view=diff
==============================================================================
--- tika/site/publish/0.5/index.html (original)
+++ tika/site/publish/0.5/index.html Thu Mar 27 09:46:52 2014
@@ -45,7 +45,7 @@
           }
         }
         if (provider == "lucid") {
-          form.action = "http://search.lucidimagination.com/p:tika";
+          form.action = "http://find.searchhub.org/p:tika";
         } else if (provider == "sl") {
           form.action = "http://search-lucene.com/tika";
         }
@@ -84,7 +84,41 @@
                 width="387" height="100"/></a>
       </div>
       <div id="content">
-        <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Apache Tika 0.5<a name="Apache_Tika_0.5"></a></h2><p>The most notable changes in Tika 0.5 over the pr
 evious release are:</p><ul><li>Improved RDF/OWL mime detection using both MIME magic as well as pattern matching. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-309">TIKA-309</a>)</li><li>An org.apache.tika.Tika facade class has been added to simplify common text extraction and type detection use cases. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-269">TIKA-269</a>)</li><li>A new parse context argument was added to the Parser.parse() method. This context map can be used to pass things like a delegate parser or other settings to the parsing process. The previous parse() method signature has been deprecated and will be removed in Tika 1.0. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-275">TIKA-275</a>)</li><li>A simple ngram-based language detection mechanism has been added along with predefined language profiles for 18 languages. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TI
 KA-209">TIKA-209</a>)</li><li>The media type registry in Tika was synchronized with the MIME type configuration in the Apache HTTP Server. Tika now knows about 1274 different media types and can detect 672 of those using 927 file extension and 280 magic byte patterns. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-285">TIKA-285</a>)</li><li>Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing PDF documents. This version is notably better than the 0.7.3 release used earlier. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-158">TIKA-158</a>)</li></ul><p>The following people have contributed to Tika 0.5 by submitting or commenting on the issues resolved in this release:</p><ul><li>Alex Baranov</li><li>Bart Hanssens</li><li>Benson Margulies</li><li>Chris A. Mattmann</li><li>Daan de Wit</li><li>Erik Hetzner</li><li>Frank Hellwig</li><li>Jeff Cadow</li><li>Joachim Zittmayr</li><li>Jukka Zitting </li><li>Julien Nioche</li
 ><li>Ken Krugler</li><li>Maxim Valyanskiy</li><li>MRIT64</li><li>Paul Borgermans</li><li>Piotr B.</li><li>Robert Newson</li><li>Sascha Szott</li><li>Ted Dunning</li><li>Thilo Goetz</li><li>Uwe Schindler</li><li>Yuan-Fang Li</li></ul><p>See <a class="externalLink" href="http://tinyurl.com/yl9prwp">http://tinyurl.com/yl9prwp</a> for more details on these contributions.</p></div>
+        <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section">
+<h2>Apache Tika 0.5<a name="Apache_Tika_0.5"></a></h2>
+<p>The most notable changes in Tika 0.5 over the previous release are:</p>
+<ul>
+<li>Improved RDF/OWL mime detection using both MIME magic as well as pattern matching. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-309">TIKA-309</a>)</li>
+<li>An org.apache.tika.Tika facade class has been added to simplify common text extraction and type detection use cases. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-269">TIKA-269</a>)</li>
+<li>A new parse context argument was added to the Parser.parse() method. This context map can be used to pass things like a delegate parser or other settings to the parsing process. The previous parse() method signature has been deprecated and will be removed in Tika 1.0. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-275">TIKA-275</a>)</li>
+<li>A simple ngram-based language detection mechanism has been added along with predefined language profiles for 18 languages. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-209">TIKA-209</a>)</li>
+<li>The media type registry in Tika was synchronized with the MIME type configuration in the Apache HTTP Server. Tika now knows about 1274 different media types and can detect 672 of those using 927 file extension and 280 magic byte patterns. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-285">TIKA-285</a>)</li>
+<li>Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing PDF documents. This version is notably better than the 0.7.3 release used earlier. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-158">TIKA-158</a>)</li></ul>
+<p>The following people have contributed to Tika 0.5 by submitting or commenting on the issues resolved in this release:</p>
+<ul>
+<li>Alex Baranov</li>
+<li>Bart Hanssens</li>
+<li>Benson Margulies</li>
+<li>Chris A. Mattmann</li>
+<li>Daan de Wit</li>
+<li>Erik Hetzner</li>
+<li>Frank Hellwig</li>
+<li>Jeff Cadow</li>
+<li>Joachim Zittmayr</li>
+<li>Jukka Zitting </li>
+<li>Julien Nioche</li>
+<li>Ken Krugler</li>
+<li>Maxim Valyanskiy</li>
+<li>MRIT64</li>
+<li>Paul Borgermans</li>
+<li>Piotr B.</li>
+<li>Robert Newson</li>
+<li>Sascha Szott</li>
+<li>Ted Dunning</li>
+<li>Thilo Goetz</li>
+<li>Uwe Schindler</li>
+<li>Yuan-Fang Li</li></ul>
+<p>See <a class="externalLink" href="http://tinyurl.com/yl9prwp">http://tinyurl.com/yl9prwp</a> for more details on these contributions.</p></div>
       </div>
       <div id="sidebar">
         <div id="navigation">