You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2010/11/01 00:33:49 UTC

svn commit: r1029517 [1/2] - in /tika/site: publish/ publish/0.5/ publish/0.6/ publish/0.7/ src/site/

Author: jukka
Date: Sun Oct 31 23:33:48 2010
New Revision: 1029517

URL: http://svn.apache.org/viewvc?rev=1029517&view=rev
Log:
TIKA-536: Updated site layout

Inline the banner images to keep the links absolute. Also auto-focus the search form.

Modified:
    tika/site/publish/0.5/documentation.html
    tika/site/publish/0.5/formats.html
    tika/site/publish/0.5/gettingstarted.html
    tika/site/publish/0.5/index.html
    tika/site/publish/0.6/formats.html
    tika/site/publish/0.6/gettingstarted.html
    tika/site/publish/0.6/index.html
    tika/site/publish/0.6/parser.html
    tika/site/publish/0.7/detection.html
    tika/site/publish/0.7/formats.html
    tika/site/publish/0.7/gettingstarted.html
    tika/site/publish/0.7/index.html
    tika/site/publish/0.7/parser.html
    tika/site/publish/0.7/parser_guide.html
    tika/site/publish/download.html
    tika/site/publish/index.html
    tika/site/publish/mail-lists.html
    tika/site/src/site/site.vm

Modified: tika/site/publish/0.5/documentation.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.5/documentation.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.5/documentation.html (original)
+++ tika/site/publish/0.5/documentation.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Apache Tika Documentation<a name="Apache_Tika_Documentation"></a></h2><p>This document descri
 bes the key abstractions and usage of Apache Tika.</p></div><div class="section"><h2>The Parser interface<a name="The_Parser_interface"></a></h2><p>The <a href="./api/org/apache/tika/parser/Parser.html">org.apache.tika.parser.Parser</a> interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:</p><div><pre>void parse(InputStream stream, ContentHandler handler, Metadata metadata)
     throws IOException, SAXException, TikaException;</pre></div><p>The <tt>parse</tt> method takes the document to be parsed and related metadata as input and outputs the results as XHTML SAX events and extra metadata. The main criteria that lead to this design were:</p><dl><dt>Streamed parsing</dt><dd>The interface should require neither the client application nor the parser implementation to keep the full document content in memory or spooled to disk. This allows even huge documents to be parsed without excessive resource requirements.</dd><dt>Structured content</dt><dd>A parser implementation should be able to include structural information (headings, links, etc.) in the extracted content. A client application can use this information for example to better judge the relevance of different parts of the parsed document.</dd><dt>Input metadata</dt><dd>A client application should be able to include metadata like the file name or declared content type with the document to be p
 arsed. The parser implementation can use this information to better guide the parsing process.</dd><dt>Output metadata</dt><dd>A parser implementation should be able to return document metadata in addition to document content. Many document formats contain metadata like the name of the author that may be useful to client applications.</dd></dl><p>These criteria are reflected in the arguments of the <tt>parse</tt> method.</p></div><div class="section"><h2>Document input stream<a name="Document_input_stream"></a></h2><p>The first argument is an <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html">InputStream</a> for reading the document to be parsed.</p><p>If this document stream can not be read, then parsing stops and the thrown <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html">IOException</a> is passed up to the client application. If the stream can be read but not parsed (for example if
  the document is corrupted), then the parser throws a <a href="./api/org/apache/tika/exception/TikaException.html">TikaException</a>.</p><p>The parser implementation will consume this stream but <i>will not close it</i>. Closing the stream is the responsibility of the client application that opened it in the first place. The recommended pattern for using streams with the <tt>parse</tt> method is:</p><div><pre>InputStream stream = ...;      // open the stream

Modified: tika/site/publish/0.5/formats.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.5/formats.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.5/formats.html (original)
+++ tika/site/publish/0.5/formats.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Supported Document Formats<a name="Supported_Document_Formats"></a></h2><p>This page lists al
 l the document formats supported by Apache Tika.</p><div class="section"><h3>Microsoft's OLE 2 Compound Document format<a name="Microsofts_OLE_2_Compound_Document_format"></a></h3><p>A number of Microsoft applications, most notably the Microsoft Office suite, use the generic OLE 2 Compound Document format as the basis of their document formats. Tika uses <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> to support a number of these formats.</p><p>The OLE2 Compound Document format is designed for use with random access files, and so the input stream passed to a Tika parser needs to be spooled in memory or in a temporary file depending on the size of the document. See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-153">TIKA-153</a> for an effort to avoid this extra temporary file if the input document already comes from a file.</p><p>In addition to the shared base format there's also a shared sets of metadata in typical OLE2 documen
 ts. Tika uses the <a class="externalLink" href="http://poi.apache.org/hpsf/">HPSF library</a> from POI to parse these property sets and exposes them as the following document metadata:</p><ul><li><tt>TITLE</tt> Title</li><li><tt>SUBJECT</tt> Subject</li><li><tt>AUTHOR</tt> Author</li><li><tt>KEYWORDS</tt> Keywords</li><li><tt>COMMENTS</tt> Comments</li><li><tt>TEMPLATE</tt> Template</li><li><tt>LAST_SAVED</tt> Last Saved By</li><li><tt>REVISION_NUMBER</tt> Revision Number</li><li><tt>LAST_PRINTED</tt> Last Printed</li><li><tt>LAST_SAVED</tt> Last Saved Time/Date</li><li><tt>LAST_SAVED</tt> Last Saved Time/Date</li><li><tt>PAGE_COUNT</tt> Number of Pages</li><li><tt>WORD_COUNT</tt> Number of Words</li><li><tt>CHARACTER_COUNT</tt> Number of Characters</li><li><tt>APPLICATION_NAME</tt> Name of Creating Application</li></ul><p>Note that in practice the metadata in many documents is either missing, incomplete or even incorrect, so a client application should not rely too much on 
 this information.</p><p>Support for the new Office Open XML format used by Microsoft Office version 2007 is pending for a POI upgrade. Current status is recorded in <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-152">TIKA-152</a>.</p><p>The generic OLE2 Compound Document format is automatically detected using a magic number, and further parsing can automatically determine the more specific document format. Tika also knows a number of common glob patterns like <tt>*.doc</tt> and <tt>*.ppt</tt> for these formats.</p><p>The supported OLE 2 Compound Document formats are:</p><dl><dt>Microsoft Excel (application/vnd.ms-excel)</dt><dd> Excel spreadsheet support is available in all versions of Tika and is based on the <a class="externalLink" href="http://poi.apache.org/hssf/">HSSF library</a> from POI.<p>The Excel parser in Tika uses the <a class="externalLink" href="http://poi.apache.org/hssf/how-to.html#event_api">HSSF event API</a> and is able to extract
  much of the document structure, including all (non-empty) worksheets and their table structures. Formula results are extracted as stored in the Excel file, and cell links are exposed as XHTML links. These features were added in Tika version 0.2.</p><p>Cell comments and formatting are currently not supported. See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-148">TIKA-148</a> and <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-103">TIKA-103</a> for the respective issues.</p></dd><dt>Microsoft Word (application/msword)</dt><dd> Word document support is available in all versions of Tika and is based on the <a class="externalLink" href="http://poi.apache.org/hwpf/">HWPF library</a> from POI.<p>The Word parser uses the <a class="externalLink" href="http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html">WordExtractor</a> class from HWPF to extract document content as a sequence of paragraphs.</p></dd><dt
 >Microsoft PowerPoint (application/vnd.ms-powerpoint)</dt><dd> PowerPoint presentation support is available in all versions of Tika and is based on the <a class="externalLink" href="http://poi.apache.org/hslf/">HSLF library</a> from POI.<p>The PowerPoint parser uses the <a class="externalLink" href="http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html">PowerPointExtractor</a> class from HSLF to extract spreadsheet content as a single paragraph.</p></dd><dt>Microsoft Visio (application/vnd.visio)</dt><dd> Visio diagram support was added in Tika version 0.2 and is based on the <a class="externalLink" href="http://poi.apache.org/hdgf/">HDGF library</a> from POI.<p>The Visio parser uses the <a class="externalLink" href="http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html">VisioExtractor</a> class from HDGF to extract diagram content as a sequence of paragraphs.</p></dd><dt>Microsoft Outlook (application/vnd.ms-outlo
 ok)</dt><dd> Outlook message support was added in Tika version 0.2 and is based on the <a class="externalLink" href="http://poi.apache.org/hsmf/">HSMF library</a> from POI.<p>The Outlook parser extracts the subject of the message and the From, To, Cc, and Bcc addresses (formatted for display) along with the body text of text/plain messages. The <tt>AUTHOR</tt>, <tt>TITLE</tt> and <tt>SUBJECT</tt> metadata properties are set explicitly, overriding potential generic document metadata retrieved from OLE2 property sets.</p></dd></dl></div><div class="section"><h3>Compression formats<a name="Compression_formats"></a></h3><p>General purpose compression formats are used to reduce the size of any kinds of documents. Tika uses a parsing pipeline to support general purpose compression: in the first stage the compressed stream decompressed and the resulting decompressed stream is passed on to a second parsing stage where it will be processed as if the document had never been compressed
 .</p><p>Tika contains magic numbers and glob patterns for auto-detecting all supported compression formats. The glob patterns of compression formats are also used to determine the name of the original uncompressed document. If a client application has supplied a <tt>RESOURCE_NAME_KEY</tt> metadata property that matches such a glob pattern, then the decompressing first parsing stage will replace the <tt>RESOURCE_NAME_KEY</tt> metadata property with the deduced original document name before passing control to the second parsing stage.</p><p>Note that apart from the special handling of the <tt>RESOURCE_NAME_KEY</tt> property, no document metadata is passed to or from the second parsing stage. Only the text content extracted by the second stage parser is returned to the client application.</p><p>The supported compression formats are:</p><dl><dt>gzip compression (application/x-gzip)</dt><dd> <a class="externalLink" href="http://en.wikipedia.org/wiki/Gzip">Gzip</a> support was add
 ed in Tika version 0.2 and is based on the <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html">GZIPInputStream</a> class in the Java 5 class library.<p>The known gzip glob patterns are <tt>*.tgz</tt>, <tt>*.gz</tt> and <tt>*-gz</tt>, and they will respectively be replaced with <tt>*.tar</tt>, <tt>*</tt> and <tt>*</tt> as described above.</p></dd><dt>bzip2 compression (application/x-bzip)</dt><dd> <a class="externalLink" href="http://en.wikipedia.org/wiki/Bzip2">Bzip2</a> support was added in Tika version 0.2 and is based on bzip2 parsing code from <a class="externalLink" href="http://ant.apache.org/">Apache Ant</a>, which in turn was originally based on work by Keiron Liddle from Aftex Software.<p>The known bzip2 glob patterns are <tt>*.tbz</tt>, <tt>*.tbz2</tt>, <tt>*.bz</tt> and <tt>*.bz2</tt>, and they will respectively be replaced with <tt>*.tar</tt>, <tt>*.tar</tt>, <tt>*</tt> and <tt>*</tt> as described above.</p></
 dd></dl></div><div class="section"><h3>Audio formats<a name="Audio_formats"></a></h3><p>Tika can detect several common audio formats and extract metadata from them. Text extraction is supported for some MIDI-based karaoke formats that contain the lyrics of the encoded audio.</p><p>See <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-94">TIKA-94</a> for an effort to integrate speech recognition support to Tika.</p><dl><dt>MP3 Audio (audio/mpeg)</dt><dd> The parsing of <a class="externalLink" href="http://www.id3.org/ID3v1">ID3v1</a> tags from MP3 files was added in Tika version 0.2. If found the following metadata is extracted and set:<ul><li><tt>TITLE</tt> Title</li><li><tt>SUBJECT</tt> Subject</li></ul><p>The above information, as well as the <tt>Album</tt>, <tt>Track</tt>, <tt>Year</tt>, <tt>Genre</tt> and additional <tt>Comment</tt> are extracted when set in the file.</p></dd><dt>MIDI audio (audio/midi)</dt><dd> Tika uses the MIDI support in <tt>ja
 vax.audio.midi</tt> to parse MIDI sequence files. Many karaoke file formats are based on MIDI, and contain lyrics as embedded text tracks that Tika knows how to extract.<p>Support for MIDI files was added in Tika 0.3.</p></dd><dt>Wave audio (audio/basic)</dt><dd> Tika supports sampled wave audio (.wav files, etc.) using the <tt>javax.audio.sampled</tt> package. Only sampling metadata is extracted.<p>Support for sampled wave audio was added in Tika 0.3. </p></dd></dl></div><div class="section"><h3>Other supported formats<a name="Other_supported_formats"></a></h3><dl><dt>Extensible Markup Language (application/xml)</dt><dd> Tika uses the <tt>javax.xml</tt> classes to parse Extensible Markup Language files. Support for Extensible Markup Language files was added in Tika 0.1.</dd><dt>HyperText Markup Language (text/html)</dt><dd> Tika uses the <a class="externalLink" href="http://sourceforge.net/projects/nekohtml">CyberNeko</a> library to parse HyperText Markup Language files. Su
 pport for HyperText Markup Language files was added in Tika 0.1.</dd><dt>Images (image/*)</dt><dd> Tika uses the <tt>javax.imageio</tt> classes to extract metadata from image files.<p>Support for Image files was added in Tika 0.2.</p></dd><dt>Java class files</dt><dd> The parsing of Java Class files is based on the asm library and work by Dave Brosius in JCR-1522.<p>Support for Java Class files was added in Tika 0.2.</p></dd><dt>Java jar archives</dt><dd> The parsing of Java JAR archives is performed using a combination of the ZIP and Java class file parsers.<p>Support for Java JAR archives was added in Tika 0.2.</p></dd><dt>OpenDocument (application/vnd.oasis.opendocument.*)</dt><dd> Tika uses the built-in ZIP and XML features in Java to parse the <a class="externalLink" href="http://en.wikipedia.org/wiki/OpenDocument">OpenDocument</a> document types used most notably by OpenOffice 2.0 and higher. The older OpenOffice 1.0 formats are also supported, though they are currentl
 y not auto-detected as well as the newer formats.<p>Support for the OpenDocument formats was added in Tika 0.3.</p></dd><dt>Plain text (text/plain)</dt><dd> Tika uses the <a class="externalLink" href="http://www.icu-project.org/">International Components for Unicode</a> Java library (ICU4J) to parse plain text. Support for plain text was added in Tika 0.1.<p>Extracting text content from plain text files is actually a relatively complex task due to the fact that the character encoding of the text file is often unknown to the parser.</p><p>The text parser in Tika uses the ICU4J <a class="externalLink" href="http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html">CharsetDetector</a> class to automatically detect the character encoding of any text input. As an added benefit, the ICU4J library is in some cases able to detect also the language in which the text is written.</p><p>The character encoding and language of the plain text document are returned as t
 he <tt>Metadata.CONTENT_ENCODING</tt> and <tt>Metadata.LANGUAGE</tt> metadata properties. If the (declared) content encoding of a text document is already known to the client application, then it can be supplied as the <tt>Metadata.CONTENT_ENCODING</tt> metadata property to the parser to simplify encoding detection.</p></dd><dt>Portable Document Format (application/pdf)</dt><dd> Tika uses the <a class="externalLink" href="http://www.pdfbox.org">PDFBox</a> library to parse Portable Document Format (PDF) documents.<p>Support for PDF was added in Tika 0.1.</p></dd><dt>Rich Text Format (application/rtf)</dt><dd> Tika uses Java's built-in Swing library to parse Rich Text Format (RTF) documents. Support for RTF was added in Tika 0.1.<p>The RTF parser in Tika uses the Swing <a class="externalLink" href="http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/text/rtf/RTFEditorKit.html">RTFEditorKit</a> class to extract all text from an RTF document as a single paragraph. Document metad
 ata extraction is currently not supported.</p></dd><dt>tar archive (application/x-tar)</dt><dd> Tika uses an adapted version of the tar parsing code from <a class="externalLink" href="http://ant.apache.org/">Apache Ant</a> to parse tar archives. The tar code is originally based on work by Timothy Gerard Endres.<p>Support for tar archives was added in Tika 0.2.</p></dd><dt>ZIP archive (application/zip)</dt><dd> Tika uses Java's built-in Zip classes to parse ZIP files.<p>Support for ZIP was added in Tika 0.2.</p></dd></dl></div></div>
       </div>

Modified: tika/site/publish/0.5/gettingstarted.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.5/gettingstarted.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.5/gettingstarted.html (original)
+++ tika/site/publish/0.5/gettingstarted.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Getting Started with Apache Tika<a name="Getting_Started_with_Apache_Tika"></a></h2><p>This d
 ocument describes how to build Apache Tika from sources and how to start using Tika in an application.</p></div><div class="section"><h2>Getting and building the sources<a name="Getting_and_building_the_sources"></a></h2><p>To build Tika from sources you first need to either <a href="../download.html">download</a> a source release or <a href="../source-repository.html">checkout</a> the latest sources from version control.</p><p>Once you have the sources, you can build them using the <a class="externalLink" href="http://maven.apache.org/">Maven 2</a> build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository.</p><div><pre>mvn install</pre></div><p>See the Maven documentation for more information about the available build options.</p><p>Note that you need Java 5 or higher to build Tika.</p></div><div class="section"><h2>Build artifacts<a name="Build_artifacts"></a></h2><p>Startin
 g with Tika 0.5, the build consists of a number of components and produces the following main binaries (x.y stands for the current Tika version number):</p><dl><dt>tika-core/target/tika-core-x.y.jar</dt><dd> Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5.</dd><dt>tika-core/target/tika-core-x.y-jdk14.jar</dt><dd> Java 1.4 version of the Tika core library.</dd><dt>tika-parsers/target/tika-parsers-x.y.jar</dt><dd> Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.</dd><dt>tika-app/target/tika-app-x.y.jar</dt><dd> Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.</dd></dl></div><div class="section"><h2>Using Tika as a Maven dependency<a name="Using_Tika_as_a_Maven_dependency"></a></h2><p>Since the 0.5 release Tika has been sp
 lit to components to give you more control over which parts of Tika you want to use in your application. The core library, tika-core, contains the key interfaces and classes, so you'll always want to include a dependency to it:</p><div><pre>  &lt;dependency&gt;
     &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;

Modified: tika/site/publish/0.5/index.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.5/index.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.5/index.html (original)
+++ tika/site/publish/0.5/index.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Apache Tika 0.5<a name="Apache_Tika_0.5"></a></h2><p>The most notable changes in Tika 0.5 ove
 r the previous release are:</p><ul><li>Improved RDF/OWL mime detection using both MIME magic as well as pattern matching. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-309">TIKA-309</a>)</li><li>An org.apache.tika.Tika facade class has been added to simplify common text extraction and type detection use cases. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-269">TIKA-269</a>)</li><li>A new parse context argument was added to the Parser.parse() method. This context map can be used to pass things like a delegate parser or other settings to the parsing process. The previous parse() method signature has been deprecated and will be removed in Tika 1.0. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-275">TIKA-275</a>)</li><li>A simple ngram-based language detection mechanism has been added along with predefined language profiles for 18 languages. (<a class="externalLink" href="https://issues.apache.or
 g/jira/browse/TIKA-209">TIKA-209</a>)</li><li>The media type registry in Tika was synchronized with the MIME type configuration in the Apache HTTP Server. Tika now knows about 1274 different media types and can detect 672 of those using 927 file extension and 280 magic byte patterns. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-285">TIKA-285</a>)</li><li>Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing PDF documents. This version is notably better than the 0.7.3 release used earlier. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-158">TIKA-158</a>)</li></ul><p>The following people have contributed to Tika 0.5 by submitting or commenting on the issues resolved in this release:</p><ul><li>Alex Baranov</li><li>Bart Hanssens</li><li>Benson Margulies</li><li>Chris A. Mattmann</li><li>Daan de Wit</li><li>Erik Hetzner</li><li>Frank Hellwig</li><li>Jeff Cadow</li><li>Joachim Zittmayr</li><li>Jukka Zitting </
 li><li>Julien Nioche</li><li>Ken Krugler</li><li>Maxim Valyanskiy</li><li>MRIT64</li><li>Paul Borgermans</li><li>Piotr B.</li><li>Robert Newson</li><li>Sascha Szott</li><li>Ted Dunning</li><li>Thilo Goetz</li><li>Uwe Schindler</li><li>Yuan-Fang Li</li></ul><p>See <a class="externalLink" href="http://tinyurl.com/yl9prwp">http://tinyurl.com/yl9prwp</a> for more details on these contributions.</p></div>
       </div>

Modified: tika/site/publish/0.6/formats.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.6/formats.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.6/formats.html (original)
+++ tika/site/publish/0.6/formats.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Supported Document Formats<a name="Supported_Document_Formats"></a></h2><p>This page lists al
 l the document formats supported by Apache Tika 0.6. Follow the links to the various parser class javadocs for more detailed information about each document format and how it is parsed by Tika.</p><ul><li><a href="#Supported_Document_Formats">Supported Document Formats</a><ul><li><a href="#HyperText_Markup_Language">HyperText Markup Language</a></li><li><a href="#XML_and_derived_formats">XML and derived formats</a></li><li><a href="#Microsoft_Office_document_formats">Microsoft Office document formats</a></li><li><a href="#OpenDocument_Format">OpenDocument Format</a></li><li><a href="#Portable_Document_Format">Portable Document Format</a></li><li><a href="#Electronic_Publication_Format">Electronic Publication Format</a></li><li><a href="#Rich_Text_Format">Rich Text Format</a></li><li><a href="#Compression_and_packaging_formats">Compression and packaging formats</a></li><li><a href="#Text_formats">Text formats</a></li><li><a href="#Audio_formats">Audio formats</a></li><li><a h
 ref="#Image_formats">Image formats</a></li><li><a href="#Video_formats">Video formats</a></li><li><a href="#Java_class_files_and_archives">Java class files and archives</a></li><li><a href="#The_mbox_format">The mbox format</a></li></ul></li></ul><div class="section"><h3><a name="HyperText_Markup_Language">HyperText Markup Language</a><a name="HyperText_Markup_Language"></a></h3><p>The HyperText Markup Language (HTML) is the lingua franca of the web. Tika uses the <a class="externalLink" href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> library to support virtually any kind of HTML found on the web. The output from the <a href="./api/org/apache/tika/parser/html/HtmlParser.html">HtmlParser</a> class is guaranteed to be well-formed and valid XHTML, and various heuristics are used to prevent things like inline scripts from cluttering the extracted text content.</p></div><div class="section"><h3><a name="XML_and_derived_formats">XML and derived formats</a><a name="XML_
 and_derived_formats"></a></h3><p>The Extensible Markup Language (XML) format is a generic format that can be used for all kinds of content. Tika has custom parsers for some widely used XML vocabularies like XHTML, OOXML and ODF, but the default <a href="./api/org/apache/tika/parser/xml/DcXMLParser.html">DcXMLParser</a> class simply extracts the text content of the document and ignores any XML structure. The only exception to this rule are Dublin Core metadata elements that are used for the document metadata.</p></div><div class="section"><h3><a name="Microsoft_Office_document_formats">Microsoft Office document formats</a><a name="Microsoft_Office_document_formats"></a></h3><p>Microsoft Office and some related applications produce documents in the generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The older OLE 2 format was introduced in Microsoft Office version 97 and was the default format until Office version 2007 and the new XML-based OOXML format. The <
 a href="./api/org/apache/tika/parser/microsoft/OfficeParser.html">OfficeParser</a> and <a href="./api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html">OOXMLParser</a> classes use <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> libraries to support text and metadata extraction from both OLE2 and OOXML documents.</p></div><div class="section"><h3><a name="OpenDocument_Format">OpenDocument Format</a><a name="OpenDocument_Format"></a></h3><p>The OpenDocument format (ODF) is used most notably as the default format of the OpenOffice.org office suite. The <a href="./api/org/apache/tika/parser/odf/OpenDocumentParser.html">OpenDocumentParser</a> class supports this format and the earlier OpenOffice 1.0 format on which ODF is based.</p></div><div class="section"><h3><a name="Portable_Document_Format">Portable Document Format</a><a name="Portable_Document_Format"></a></h3><p>The <a href="./api/org/apache/tika/parser/pdf/PDFParser.html">PDFParser</a> class p
 arsers Portable Document Format (PDF) documents using the <a class="externalLink" href="http://pdfbox.apache.org/">Apache PDFBox</a> library.</p></div><div class="section"><h3><a name="Electronic_Publication_Format">Electronic Publication Format</a><a name="Electronic_Publication_Format"></a></h3><p>The <a href="./api/org/apache/tika/parser/epub/EpubParser.html">EpubParser</a> class supports the Electronic Publication Format (EPUB) used for many digital books.</p></div><div class="section"><h3><a name="Rich_Text_Format">Rich Text Format</a><a name="Rich_Text_Format"></a></h3><p>The <a href="./api/org/apache/tika/parser/rtf/RTFParser.html">RTFParser</a> class uses the standard javax.swing.text.rtf feature to extract text content from Rich Text Format (RTF) documents.</p></div><div class="section"><h3><a name="Compression_and_packaging_formats">Compression and packaging formats</a><a name="Compression_and_packaging_formats"></a></h3><p>Tika uses the <a class="externalLink" hre
 f="http://commons.apache.org/compress/">Commons Compress</a> library to support various compression and packaging formats. The <a href="./api/org/apache/tika/parser/pkg/PackageParser.html">PackageParser</a> class and its subclasses first parse the top level compression or packaging format and then pass the unpacked document streams to a second parsing stage using the parser instance specified in the parse context.</p></div><div class="section"><h3><a name="Text_formats">Text formats</a><a name="Text_formats"></a></h3><p>Extracting text content from plain text files seems like a simple task until you start thinking of all the possible character encodings. The <a href="./api/org/apache/tika/parser/txt/TXTParser.html">TXTParser</a> class uses encoding detection code from the <a class="externalLink" href="http://site.icu-project.org/">ICU</a> project to automatically detect the character encoding of a text document.</p></div><div class="section"><h3><a name="Audio_formats">Audio
  formats</a><a name="Audio_formats"></a></h3><p>Tika can detect several common audio formats and extract metadata from them. Even text extraction is supported for some audio files that contain lyrics or other textual content. The <a href="./api/org/apache/tika/parser/audio/AudioParser.html">AudioParser</a> and <a href="./api/org/apache/tika/parser/audio/MidiParser.html">MidiParser</a> classes use standard javax.sound features to process simple audio formats, and the <a href="./api/org/apache/tika/parser/mp3/Mp3Parser.html">Mp3Parser</a> class adds support for the widely used MP3 format.</p></div><div class="section"><h3><a name="Image_formats">Image formats</a><a name="Image_formats"></a></h3><p>The <a href="./api/org/apache/tika/parser/image/ImageParser.html">ImageParser</a> class uses the standard javax.imageio feature to extract simple metadata from image formats supported by the Java platform. More complex image metadata is available through the <a href="./api/org/apache
 /tika/parser/jpeg/JpegParser.html">JpegParser</a> class that uses the metadata-extractor library to supports Exif metadata extraction from Jpeg images.</p></div><div class="section"><h3><a name="Video_formats">Video formats</a><a name="Video_formats"></a></h3><p>Currently Tika only supports the Flash video format using a simple parsing algorithm implemented in the <a href="./api/org/apache/tika/parser/flv/FLVParser">FLVParser</a> class.</p></div><div class="section"><h3><a name="Java_class_files_and_archives">Java class files and archives</a><a name="Java_class_files_and_archives"></a></h3><p>The <a href="./api/org/apache/tika/parser/asm/ClassParser">ClassParser</a> class extracts class names and method signatures from Java class files, and the <a href="./api/org/apache/tika/parser/pkg/ZipParser.html">ZipParser</a> class supports also jar archives.</p></div><div class="section"><h3><a name="The_mbox_format">The mbox format</a><a name="The_mbox_format"></a></h3><p>The <a href
 ="./api/org/apache/tika/parser/mbox/MboxParser.html">MboxParser</a> can extract email messages from the mbox format used by many email archives and Unix-style mailboxes.</p></div></div>
       </div>

Modified: tika/site/publish/0.6/gettingstarted.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.6/gettingstarted.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.6/gettingstarted.html (original)
+++ tika/site/publish/0.6/gettingstarted.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Getting Started with Apache Tika<a name="Getting_Started_with_Apache_Tika"></a></h2><p>This d
 ocument describes how to build Apache Tika from sources and how to start using Tika in an application.</p></div><div class="section"><h2>Getting and building the sources<a name="Getting_and_building_the_sources"></a></h2><p>To build Tika from sources you first need to either <a href="../download.html">download</a> a source release or <a href="../source-repository.html">checkout</a> the latest sources from version control.</p><p>Once you have the sources, you can build them using the <a class="externalLink" href="http://maven.apache.org/">Maven 2</a> build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository.</p><div><pre>mvn install</pre></div><p>See the Maven documentation for more information about the available build options.</p><p>Note that you need Java 5 or higher to build Tika.</p></div><div class="section"><h2>Build artifacts<a name="Build_artifacts"></a></h2><p>The Tik
 a 0.6 build consists of a number of components and produces the following main binaries:</p><dl><dt>tika-core/target/tika-core-0.6.jar</dt><dd> Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5.</dd><dt>tika-parsers/target/tika-parsers-0.6.jar</dt><dd> Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.</dd><dt>tika-app/target/tika-app-0.6.jar</dt><dd> Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.</dd><dt>tika-bundle/target/tika-bundle-0.6.jar</dt><dd> Tika bundle. An OSGi bundle that includes everything you need to use all Tika functionality in an OSGi environment.</dd></dl></div><div class="section"><h2>Using Tika as a Maven dependency<a name="Using_Tika_as_a_Maven_dependency"></a></h2><p>The core library, tika-core, co
 ntains the key interfaces and classes of Tika and can be used by itself if you don't need the full set of parsers from the tika-parsers component. The tika-core dependency looks like this:</p><div><pre>  &lt;dependency&gt;
     &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;

Modified: tika/site/publish/0.6/index.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.6/index.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.6/index.html (original)
+++ tika/site/publish/0.6/index.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Apache Tika 0.6<a name="Apache_Tika_0.6"></a></h2><p>The most notable changes in Tika 0.6 ove
 r the previous release are:</p><ul><li>Mime-type detection for HTML (and all types) has been improved, allowing malformed HTML files and those HTML files that require a bit more observed content before the type is properly detected, are now correctly identified by the AutoDetectParser. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-327">TIKA-327</a>, <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-357">TIKA-357</a>, <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-366">TIKA-366</a>, <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-367">TIKA-367</a>)</li><li>Tika now has an additional OSGi bundle packaging that includes all the required parser libraries. This bundle package makes it easy to use all Tika features in an OSGi environment. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-340">TIKA-340</a>, <a class="externalLink" href="https://issues.apache
 .org/jira/browse/TIKA-342">TIKA-342</a>)</li><li>The Apache POI dependency used for parsing Microsoft Office file formats has been upgraded to version 3.6. The most visible improvement in this version is the notably reduced ooxml jar file size. The tika-app jar size is now down to 15MB from the 25MB in Tika 0.5. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-353">TIKA-353</a>)</li><li>Handling of character encoding information in input metadata and HTML &lt;meta&gt; tags has been improved. When no applicable encoding information is available, the encoding is detected by looking at the input data. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-332">TIKA-332</a>, <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-334">TIKA-334</a>, <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-335">TIKA-335</a>, <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-341">TIK
 A-341</a>) </li><li>Some document types like Excel spreadsheets contain content like numbers or formulas whose exact text format depends on the current locale. So far Tika has used the platform default locale in such cases, but clients can now explicitly specify the locale by passing a Locale instance in the parse context. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-125">TIKA-125</a>)</li><li>The default text output encoding of the tika-app jar is now UTF-8 when running on Mac OS X. This is because the default encoding used by Java is not compatible with the console application in Mac OS X. On all other platforms the text output from tika-app still uses the platform default encoding. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-324">TIKA-324</a>)</li><li>A flash video (video/x-flv) parser has been added. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-328">TIKA-328</a>)</li><li>The handling 
 of Number and Date cell formatting within the Microsoft Excel documents has been added. This include currencies, percentages and scientific formats. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-103">TIKA-103</a>)</li></ul><p>The following people have contributed to Tika 0.6 by submitting or commenting on the issues resolved in this release:</p><ul><li>Andrzej Bialecki</li><li>Bertrand Delacretaz</li><li>Chris A. Mattmann</li><li>Dave Meikle</li><li>Erik Hetzner</li><li>Felix Meschberger</li><li>Jukka Zitting</li><li>Julien Nioche</li><li>Ken Krugler </li><li>Luke Nezda</li><li>Maxim Valyanskiy</li><li>Niall Pemberton</li><li>Peter Wolanin </li><li>Piotr B.</li><li>Sami Siren</li><li>Yuan-Fang Li</li></ul><p>See <a class="externalLink" href="http://tinyurl.com/yc3dk67">http://tinyurl.com/yc3dk67</a> for more details on these contributions.</p></div>
       </div>

Modified: tika/site/publish/0.6/parser.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.6/parser.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.6/parser.html (original)
+++ tika/site/publish/0.6/parser.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>The Parser interface<a name="The_Parser_interface"></a></h2><p>The <a href="./api/org/apache/
 tika/parser/Parser.html">org.apache.tika.parser.Parser</a> interface is the key concept of Apache Tika. It hides the complexity of different file formats and parsing libraries while providing a simple and powerful mechanism for client applications to extract structured text content and metadata from all sorts of documents. All this is achieved with a single method:</p><div><pre>void parse(
     InputStream stream, ContentHandler handler, Metadata metadata,

Modified: tika/site/publish/0.7/detection.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.7/detection.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.7/detection.html (original)
+++ tika/site/publish/0.7/detection.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Content Detection<a name="Content_Detection"></a></h2><p>This page gives you information on h
 ow content and language detection works with Apache Tika, and how to tune the behaviour of Tika.</p><ul><li><a href="#Content_Detection">Content Detection</a><ul><li><a href="#The_Detector_Interface">The Detector Interface</a></li><li><a href="#Mime_Magic_Detction">Mime Magic Detction</a></li><li><a href="#Resource_Name_Based_Detection">Resource Name Based Detection</a></li><li><a href="#Known_Content_Type_Detection">Known Content Type &quot;Detection</a></li><li><a href="#The_default_Mime_Types_Detector">The default Mime Types Detector</a></li><li><a href="#Container_Aware_Detection">Container Aware Detection</a></li><li><a href="#Language_Detection">Language Detection</a></li></ul></li></ul><div class="section"><h3><a name="The_Detector_Interface">The Detector Interface</a><a name="The_Detector_Interface"></a></h3><p>The <a href="./api/org/apache/tika/detect/Detector.html">org.apache.tika.detect.Detector</a> interface is the basis for most of the content type detection in 
 Apache Tika. All the different ways of detecting content all implement the same common method:</p><div><pre>MediaType detect(java.io.InputStream input,
                  Metadata metadata) throws java.io.IOException</pre></div><p>The <tt>detect</tt> method takes the stream to inspect, and a <tt>Metadata</tt> object that holds any additional information on the content. The detector will return a <a href="./api/org/apache/tika/mime/MediaType.html">MediaType</a> object describing its best guess as to the type of the file.</p><p>In general, only two keys on the Metadata object are used by Detectors. These are <tt>Metadata.RESOURCE_NAME_KEY</tt> which should hold the name of the file (where known), and <tt>Metadata.CONTENT_TYPE</tt> which should hold the advertised content type of the file (eg from a webserver or a content repository).</p></div><div class="section"><h3><a name="Mime_Magic_Detction">Mime Magic Detction</a><a name="Mime_Magic_Detction"></a></h3><p>By looking for special (&quot;magic&quot;) patterns of bytes near the start of the file, it is often possible to detect the type of the file. For some file types, this is
  a simple process. For others, typically container based formats, the magic detection may not be enough. (More detail on detecting container formats below)</p><p>Tika is able to make use of a a mime magic info file, in the <a class="externalLink" href="http://www.freedesktop.org/standards/shared-mime-info">Freedesktop MIME-info</a> format to peform mime magic detection.</p><p>This is provided within Tika by <a href="./api/org/apache/tika/detect/MagicDetector.html">org.apache.tika.detect.MagicDetector</a>. It is most commonly access via <a href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.MimeTypes</a>, normally sourced from the <tt>tika-mimetypes.xml</tt> file.</p></div><div class="section"><h3><a name="Resource_Name_Based_Detection">Resource Name Based Detection</a><a name="Resource_Name_Based_Detection"></a></h3><p>Where the name of the file is known, it is sometimes possible to guess the file type from the name or extension. Within the <tt>tika-mimetyp
 es.xml</tt> file is a list of patterns which are used to identify the type from the filename.</p><p>However, because files may be renamed, this method of detection is quick but not always as accurate.</p><p>This is provided within Tika by <a href="./api/org/apache/tika/detect/NameDetector.html">org.apache.tika.detect.NameDetector</a>.</p></div><div class="section"><h3><a name="Known_Content_Type_Detection">Known Content Type &quot;Detection</a><a name="Known_Content_Type_Detection"></a></h3><p>Sometimes, the mime type for a file is already known, such as when downloading from a webserver, or when retrieving from a content store. This information can be used by detectors, such as <a href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.MimeTypes</a>,</p></div><div class="section"><h3><a name="The_default_Mime_Types_Detector">The default Mime Types Detector</a><a name="The_default_Mime_Types_Detector"></a></h3><p>By default, the mime type detection in Tika is p
 rovided by <a href="./api/org/apache/tika/mime/MimeTypes.html">org.apache.tika.mime.MimeTypes</a>. This detector makes use of <tt>tika-mimetypes.xml</tt> to power magic based and filename based detection.</p><p>Firstly, magic based detection is used on the start of the file. If the file is an XML file, then the start of the XML is processed to look for root elements. Next, if available, the filename (from <tt>Metadata.RESOURCE_NAME_KEY</tt>) is then used to improve the detail of the detection, such as when magic detects a text file, and the filename hints it's really a CSV. Finally, if available, the supplied content type (from <tt>Metadata.CONTENT_TYPE</tt>) is used to further refine the type.</p></div><div class="section"><h3><a name="Container_Aware_Detection">Container Aware Detection</a><a name="Container_Aware_Detection"></a></h3><p>Several common file formats are actually held within a common container format. One example is the PowerPoint .ppt and Word .doc formats, 
 which are both held within an OLE2 container. Another is Apple iWork formats, which are actually a series of XML files within a Zip file.</p><p>Using magic detection, it is easy to spot that a given file is an OLE2 document, or a Zip file. Using magic detection alone, it is very difficult (and often impossible) to tell what kind of file lives inside the container.</p><p>For some use cases, speed is important, so having a quick way to know the container type is sufficient. For other cases however, you don't mind spending a bit of time (and memory!) processing the container to get a more accurate answer on its contents. For these cases, a container aware detector should be used.</p><p>Tika provides a wrapping detector in the parsers bundle, of <a href="./api/org/apache/tika/detect/ContainerAwareDetector.html">org.apache.tika.detect.ContainerAwareDetector</a>. This detector will check for certain known containers, and if found, will open them and detect the appropriate type bas
 ed on the contents. If the file isn't a known container, it will fall back to another detector for the answer (most commonly the default <tt>MimeTypes</tt> detector)</p><p>Because this detector needs to read the whole file to process the container, it must be used with a <a href="./api/org/apache/tika/io/TikaInputStream.html">org.apache.tika.io.TikaInputStream</a>. If called with a regular <tt>InputStream</tt>, then all work will be done by the fallback detector.</p><p>For more information on container formats and Tika, see <a class="externalLink" href="http://wiki.apache.org/tika/MetadataDiscussion"></a></p></div><div class="section"><h3><a name="Language_Detection">Language Detection</a><a name="Language_Detection"></a></h3><p>Tika is able to help identify the language of a piece of text, which is useful when extracting text from document formats which do not include language information in their metadata.</p><p>The language detection is provided by <a href="./api/org/apac
 he/tika/language/LanguageIdentifier.html">org.apache.tika.language.LanguageIdentifier</a></p></div></div>

Modified: tika/site/publish/0.7/formats.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.7/formats.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.7/formats.html (original)
+++ tika/site/publish/0.7/formats.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Supported Document Formats<a name="Supported_Document_Formats"></a></h2><p>This page lists al
 l the document formats supported by Apache Tika 0.7. Follow the links to the various parser class javadocs for more detailed information about each document format and how it is parsed by Tika.</p><ul><li><a href="#Supported_Document_Formats">Supported Document Formats</a><ul><li><a href="#HyperText_Markup_Language">HyperText Markup Language</a></li><li><a href="#XML_and_derived_formats">XML and derived formats</a></li><li><a href="#Microsoft_Office_document_formats">Microsoft Office document formats</a></li><li><a href="#OpenDocument_Format">OpenDocument Format</a></li><li><a href="#Portable_Document_Format">Portable Document Format</a></li><li><a href="#Electronic_Publication_Format">Electronic Publication Format</a></li><li><a href="#Rich_Text_Format">Rich Text Format</a></li><li><a href="#Compression_and_packaging_formats">Compression and packaging formats</a></li><li><a href="#Text_formats">Text formats</a></li><li><a href="#Audio_formats">Audio formats</a></li><li><a h
 ref="#Image_formats">Image formats</a></li><li><a href="#Video_formats">Video formats</a></li><li><a href="#Java_class_files_and_archives">Java class files and archives</a></li><li><a href="#The_mbox_format">The mbox format</a></li></ul></li></ul><div class="section"><h3><a name="HyperText_Markup_Language">HyperText Markup Language</a><a name="HyperText_Markup_Language"></a></h3><p>The HyperText Markup Language (HTML) is the lingua franca of the web. Tika uses the <a class="externalLink" href="http://home.ccil.org/~cowan/XML/tagsoup/">TagSoup</a> library to support virtually any kind of HTML found on the web. The output from the <a href="./api/org/apache/tika/parser/html/HtmlParser.html">HtmlParser</a> class is guaranteed to be well-formed and valid XHTML, and various heuristics are used to prevent things like inline scripts from cluttering the extracted text content.</p></div><div class="section"><h3><a name="XML_and_derived_formats">XML and derived formats</a><a name="XML_
 and_derived_formats"></a></h3><p>The Extensible Markup Language (XML) format is a generic format that can be used for all kinds of content. Tika has custom parsers for some widely used XML vocabularies like XHTML, OOXML and ODF, but the default <a href="./api/org/apache/tika/parser/xml/DcXMLParser.html">DcXMLParser</a> class simply extracts the text content of the document and ignores any XML structure. The only exception to this rule are Dublin Core metadata elements that are used for the document metadata.</p></div><div class="section"><h3><a name="Microsoft_Office_document_formats">Microsoft Office document formats</a><a name="Microsoft_Office_document_formats"></a></h3><p>Microsoft Office and some related applications produce documents in the generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The older OLE 2 format was introduced in Microsoft Office version 97 and was the default format until Office version 2007 and the new XML-based OOXML format. The <
 a href="./api/org/apache/tika/parser/microsoft/OfficeParser.html">OfficeParser</a> and <a href="./api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html">OOXMLParser</a> classes use <a class="externalLink" href="http://poi.apache.org/">Apache POI</a> libraries to support text and metadata extraction from both OLE2 and OOXML documents.</p></div><div class="section"><h3><a name="OpenDocument_Format">OpenDocument Format</a><a name="OpenDocument_Format"></a></h3><p>The OpenDocument format (ODF) is used most notably as the default format of the OpenOffice.org office suite. The <a href="./api/org/apache/tika/parser/odf/OpenDocumentParser.html">OpenDocumentParser</a> class supports this format and the earlier OpenOffice 1.0 format on which ODF is based.</p></div><div class="section"><h3><a name="Portable_Document_Format">Portable Document Format</a><a name="Portable_Document_Format"></a></h3><p>The <a href="./api/org/apache/tika/parser/pdf/PDFParser.html">PDFParser</a> class p
 arsers Portable Document Format (PDF) documents using the <a class="externalLink" href="http://pdfbox.apache.org/">Apache PDFBox</a> library.</p></div><div class="section"><h3><a name="Electronic_Publication_Format">Electronic Publication Format</a><a name="Electronic_Publication_Format"></a></h3><p>The <a href="./api/org/apache/tika/parser/epub/EpubParser.html">EpubParser</a> class supports the Electronic Publication Format (EPUB) used for many digital books.</p></div><div class="section"><h3><a name="Rich_Text_Format">Rich Text Format</a><a name="Rich_Text_Format"></a></h3><p>The <a href="./api/org/apache/tika/parser/rtf/RTFParser.html">RTFParser</a> class uses the standard javax.swing.text.rtf feature to extract text content from Rich Text Format (RTF) documents.</p></div><div class="section"><h3><a name="Compression_and_packaging_formats">Compression and packaging formats</a><a name="Compression_and_packaging_formats"></a></h3><p>Tika uses the <a class="externalLink" hre
 f="http://commons.apache.org/compress/">Commons Compress</a> library to support various compression and packaging formats. The <a href="./api/org/apache/tika/parser/pkg/PackageParser.html">PackageParser</a> class and its subclasses first parse the top level compression or packaging format and then pass the unpacked document streams to a second parsing stage using the parser instance specified in the parse context.</p></div><div class="section"><h3><a name="Text_formats">Text formats</a><a name="Text_formats"></a></h3><p>Extracting text content from plain text files seems like a simple task until you start thinking of all the possible character encodings. The <a href="./api/org/apache/tika/parser/txt/TXTParser.html">TXTParser</a> class uses encoding detection code from the <a class="externalLink" href="http://site.icu-project.org/">ICU</a> project to automatically detect the character encoding of a text document.</p></div><div class="section"><h3><a name="Audio_formats">Audio
  formats</a><a name="Audio_formats"></a></h3><p>Tika can detect several common audio formats and extract metadata from them. Even text extraction is supported for some audio files that contain lyrics or other textual content. The <a href="./api/org/apache/tika/parser/audio/AudioParser.html">AudioParser</a> and <a href="./api/org/apache/tika/parser/audio/MidiParser.html">MidiParser</a> classes use standard javax.sound features to process simple audio formats, and the <a href="./api/org/apache/tika/parser/mp3/Mp3Parser.html">Mp3Parser</a> class adds support for the widely used MP3 format.</p></div><div class="section"><h3><a name="Image_formats">Image formats</a><a name="Image_formats"></a></h3><p>The <a href="./api/org/apache/tika/parser/image/ImageParser.html">ImageParser</a> class uses the standard javax.imageio feature to extract simple metadata from image formats supported by the Java platform. More complex image metadata is available through the <a href="./api/org/apache
 /tika/parser/jpeg/JpegParser.html">JpegParser</a> class that uses the metadata-extractor library to supports Exif metadata extraction from Jpeg images.</p></div><div class="section"><h3><a name="Video_formats">Video formats</a><a name="Video_formats"></a></h3><p>Currently Tika only supports the Flash video format using a simple parsing algorithm implemented in the <a href="./api/org/apache/tika/parser/flv/FLVParser">FLVParser</a> class.</p></div><div class="section"><h3><a name="Java_class_files_and_archives">Java class files and archives</a><a name="Java_class_files_and_archives"></a></h3><p>The <a href="./api/org/apache/tika/parser/asm/ClassParser">ClassParser</a> class extracts class names and method signatures from Java class files, and the <a href="./api/org/apache/tika/parser/pkg/ZipParser.html">ZipParser</a> class supports also jar archives.</p></div><div class="section"><h3><a name="The_mbox_format">The mbox format</a><a name="The_mbox_format"></a></h3><p>The <a href
 ="./api/org/apache/tika/parser/mbox/MboxParser.html">MboxParser</a> can extract email messages from the mbox format used by many email archives and Unix-style mailboxes.</p></div></div>
       </div>

Modified: tika/site/publish/0.7/gettingstarted.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.7/gettingstarted.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.7/gettingstarted.html (original)
+++ tika/site/publish/0.7/gettingstarted.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Getting Started with Apache Tika<a name="Getting_Started_with_Apache_Tika"></a></h2><p>This d
 ocument describes how to build Apache Tika from sources and how to start using Tika in an application.</p></div><div class="section"><h2>Getting and building the sources<a name="Getting_and_building_the_sources"></a></h2><p>To build Tika from sources you first need to either <a href="../download.html">download</a> a source release or <a href="../source-repository.html">checkout</a> the latest sources from version control.</p><p>Once you have the sources, you can build them using the <a class="externalLink" href="http://maven.apache.org/">Maven 2</a> build system. Executing the following command in the base directory will build the sources and install the resulting artifacts in your local Maven repository.</p><div><pre>mvn install</pre></div><p>See the Maven documentation for more information about the available build options.</p><p>Note that you need Java 5 or higher to build Tika.</p></div><div class="section"><h2>Build artifacts<a name="Build_artifacts"></a></h2><p>The Tik
 a 0.7 build consists of a number of components and produces the following main binaries:</p><dl><dt>tika-core/target/tika-core-0.7.jar</dt><dd> Tika core library. Contains the core interfaces and classes of Tika, but none of the parser implementations. Depends only on Java 5.</dd><dt>tika-parsers/target/tika-parsers-0.7.jar</dt><dd> Tika parsers. Collection of classes that implement the Tika Parser interface based on various external parser libraries.</dd><dt>tika-app/target/tika-app-0.7.jar</dt><dd> Tika application. Combines the above libraries and all the external parser libraries into a single runnable jar with a GUI and a command line interface.</dd><dt>tika-bundle/target/tika-bundle-0.7.jar</dt><dd> Tika bundle. An OSGi bundle that includes everything you need to use all Tika functionality in an OSGi environment.</dd></dl></div><div class="section"><h2>Using Tika as a Maven dependency<a name="Using_Tika_as_a_Maven_dependency"></a></h2><p>The core library, tika-core, co
 ntains the key interfaces and classes of Tika and can be used by itself if you don't need the full set of parsers from the tika-parsers component. The tika-core dependency looks like this:</p><div><pre>  &lt;dependency&gt;
     &lt;groupId&gt;org.apache.tika&lt;/groupId&gt;

Modified: tika/site/publish/0.7/index.html
URL: http://svn.apache.org/viewvc/tika/site/publish/0.7/index.html?rev=1029517&r1=1029516&r2=1029517&view=diff
==============================================================================
--- tika/site/publish/0.7/index.html (original)
+++ tika/site/publish/0.7/index.html Sun Oct 31 23:33:48 2010
@@ -26,7 +26,6 @@
 
 
 
-
 <html xmlns="http://www.w3.org/1999/xhtml">
   <head>
     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
@@ -69,23 +68,21 @@
             document.forms['searchform'].elements['searchProvider'].value = provider;
           }
         }
+        document.forms['searchform'].elements['q'].focus();
       }
     </script>
   </head>
   <body onLoad="initProvider();">
     <div id="body">
       <div id="banner">
-                    <a href="" id="bannerLeft"  title="Apache Tika"  >
-    
-                                            <img src="../tika.png" alt="Apache Tika" />
-    
-            </a>
-                          <a href="http://www.apache.org/" id="bannerRight"  title="The Apache Software Foundation"  >
-    
-                                            <img src="../asf-logo.gif" alt="The Apache Software Foundation" />
-    
-            </a>
-            </div>
+        <a href="http://tika.apache.org" id="bannerLeft" title="Apache Tika"
+          ><img src="tika.png" alt="Apache Tika"
+                width="292" height="100"/></a>
+        <a href="http://www.apache.org/" id="bannerRight"
+           title="The Apache Software Foundation"
+          ><img src="asf-logo.gif" alt="The Apache Software Foundation"
+                width="387" height="100"/></a>
+      </div>
       <div id="content">
         <!-- Licensed to the Apache Software Foundation (ASF) under one or more --><!-- contributor license agreements.  See the NOTICE file distributed with --><!-- this work for additional information regarding copyright ownership. --><!-- The ASF licenses this file to You under the Apache License, Version 2.0 --><!-- (the "License"); you may not use this file except in compliance with --><!-- the License.  You may obtain a copy of the License at --><!--  --><!-- http://www.apache.org/licenses/LICENSE-2.0 --><!--  --><!-- Unless required by applicable law or agreed to in writing, software --><!-- distributed under the License is distributed on an "AS IS" BASIS, --><!-- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. --><!-- See the License for the specific language governing permissions and --><!-- limitations under the License. --><div class="section"><h2>Apache Tika 0.7<a name="Apache_Tika_0.7"></a></h2><p>The most notable changes in Tika 0.7 ove
 r the previous release are:</p><ul><li>MP3 file parsing was improved, including Channel and SampleRate extraction and ID3v2 support (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-368">TIKA-368</a>, <a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-372">TIKA-372</a>). Further, audio parsing mime detection was also improved for the MIDI format. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-199">TIKA-199</a>)</li><li>Tika no longer relies on X11 for its RTF parsing functionality. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-386">TIKA-386</a>)</li><li>A Thread-safe bug in the AutoDetectParser was discovered and addressed. (<a class="externalLink" href="https://issues.apache.org/jira/browse/TIKA-374">TIKA-374</a>)</li><li>Upgrade to PDFBox 1.0.0. The new PDFBox version improves PDF parsing performance and fixes a number of text extraction issues. (<a class="externalLink" hre
 f="https://issues.apache.org/jira/browse/TIKA-380">TIKA-380</a>)</li></ul><p>The following people have contributed to Tika 0.7 by submitting or commenting on the issues resolved in this release:</p><ul><li>Adam Rauch </li><li>Benson Margulies </li><li>Brett S. </li><li>Chris A. Mattmann </li><li>Daan de Wit </li><li>Dave Meikle </li><li>Durville </li><li>Ingo Renner </li><li>Jukka Zitting </li><li>Ken Krugler </li><li>Kenny Neal </li><li>Markus Goldbach</li><li>Maxim Valyanskiy </li><li>Nick Burch </li><li>Sami Siren </li><li>Uwe Schindler </li></ul><p>See <a class="externalLink" href="http://tinyurl.com/yklopby">http://tinyurl.com/yklopby</a> for more details on these contributions.</p></div>
       </div>