You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ma...@apache.org on 2012/03/24 06:30:15 UTC

svn commit: r1304707 - in /tika/site/src/site: apt/1.1/ apt/1.1/detection.apt apt/1.1/formats.apt apt/1.1/gettingstarted.apt apt/1.1/index.apt apt/1.1/parser.apt apt/1.1/parser_guide.apt apt/download.apt apt/index.apt site.xml

Author: mattmann
Date: Sat Mar 24 05:30:14 2012
New Revision: 1304707

URL: http://svn.apache.org/viewvc?rev=1304707&view=rev
Log:
- update Tika website for 1.1

Added:
    tika/site/src/site/apt/1.1/
    tika/site/src/site/apt/1.1/detection.apt
    tika/site/src/site/apt/1.1/formats.apt
    tika/site/src/site/apt/1.1/gettingstarted.apt
    tika/site/src/site/apt/1.1/index.apt
    tika/site/src/site/apt/1.1/parser.apt
    tika/site/src/site/apt/1.1/parser_guide.apt
Modified:
    tika/site/src/site/apt/download.apt
    tika/site/src/site/apt/index.apt
    tika/site/src/site/site.xml

Added: tika/site/src/site/apt/1.1/detection.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/detection.apt?rev=1304707&view=auto
==============================================================================
--- tika/site/src/site/apt/1.1/detection.apt (added)
+++ tika/site/src/site/apt/1.1/detection.apt Sat Mar 24 05:30:14 2012
@@ -0,0 +1,152 @@
+                          -----------------
+                          Content Detection
+                          -----------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Content Detection
+
+   This page gives you information on how content and language detection
+   works with Apache Tika, and how to tune the behaviour of Tika.
+
+%{toc|section=1|fromDepth=1}
+
+* {The Detector Interface}
+
+  The
+  {{{./api/org/apache/tika/detect/Detector.html}org.apache.tika.detect.Detector}}
+  interface is the basis for most of the content type detection in Apache
+  Tika. All the different ways of detecting content all implement the
+  same common method:
+
+---
+MediaType detect(java.io.InputStream input,
+                 Metadata metadata) throws java.io.IOException
+---
+
+   The <<<detect>>> method takes the stream to inspect, and a 
+   <<<Metadata>>> object that holds any additional information on
+   the content. The detector will return a 
+   {{{./api/org/apache/tika/mime/MediaType.html}MediaType}} object describing
+   its best guess as to the type of the file.
+
+   In general, only two keys on the Metadata object are used by Detectors.
+   These are <<<Metadata.RESOURCE_NAME_KEY>>> which should hold the name
+   of the file (where known), and <<<Metadata.CONTENT_TYPE>>> which should
+   hold the advertised content type of the file (eg from a webserver or
+   a content repository).
+
+
+* {Mime Magic Detction}
+
+  By looking for special ("magic") patterns of bytes near the start of
+  the file, it is often possible to detect the type of the file. For
+  some file types, this is a simple process. For others, typically
+  container based formats, the magic detection may not be enough. (More
+  detail on detecting container formats below)
+
+  Tika is able to make use of a a mime magic info file, in the 
+  {{{http://www.freedesktop.org/standards/shared-mime-info}Freedesktop MIME-info}} 
+  format to peform mime magic detection.
+
+  This is provided within Tika by
+  {{{./api/org/apache/tika/detect/MagicDetector.html}org.apache.tika.detect.MagicDetector}}. It is most commonly access via
+  {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}},
+  normally sourced from the <<<tika-mimetypes.xml>>> file.
+   
+
+* {Resource Name Based Detection}
+
+  Where the name of the file is known, it is sometimes possible to guess 
+  the file type from the name or extension. Within the 
+  <<<tika-mimetypes.xml>>> file is a list of patterns which are used to
+  identify the type from the filename.
+
+  However, because files may be renamed, this method of detection is quick
+  but not always as accurate.
+
+  This is provided within Tika by
+  {{{./api/org/apache/tika/detect/NameDetector.html}org.apache.tika.detect.NameDetector}}.
+
+
+* {Known Content Type "Detection}
+
+  Sometimes, the mime type for a file is already known, such as when
+  downloading from a webserver, or when retrieving from a content store.
+  This information can be used by detectors, such as
+  {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}},
+
+
+* {The default Mime Types Detector}
+
+  By default, the mime type detection in Tika is provided by
+  {{{./api/org/apache/tika/mime/MimeTypes.html}org.apache.tika.mime.MimeTypes}}.
+  This detector makes use of <<<tika-mimetypes.xml>>> to power
+  magic based and filename based detection.
+
+  Firstly, magic based detection is used on the start of the file.
+  If the file is an XML file, then the start of the XML is processed
+  to look for root elements. Next, if available, the filename 
+  (from <<<Metadata.RESOURCE_NAME_KEY>>>) is
+  then used to improve the detail of the detection, such as when magic
+  detects a text file, and the filename hints it's really a CSV. Finally,
+  if available, the supplied content type (from <<<Metadata.CONTENT_TYPE>>>)
+  is used to further refine the type.
+
+
+* {Container Aware Detection}
+
+  Several common file formats are actually held within a common container
+  format. One example is the PowerPoint .ppt and Word .doc formats, which
+  are both held within an OLE2 container. Another is Apple iWork formats,
+  which are actually a series of XML files within a Zip file.
+
+  Using magic detection, it is easy to spot that a given file is an OLE2
+  document, or a Zip file. Using magic detection alone, it is very difficult
+  (and often impossible) to tell what kind of file lives inside the container.
+
+  For some use cases, speed is important, so having a quick way to know the
+  container type is sufficient. For other cases however, you don't mind 
+  spending a bit of time (and memory!) processing the container to get a 
+  more accurate answer on its contents. For these cases, a container
+  aware detector should be used.
+
+  Tika provides a wrapping detector in the parsers bundle, of
+  {{{./api/org/apache/tika/detect/ContainerAwareDetector.html}org.apache.tika.detect.ContainerAwareDetector}}.
+  This detector will check for certain known containers, and if found,
+  will open them and detect the appropriate type based on the contents.
+  If the file isn't a known container, it will fall back to another
+  detector for the answer (most commonly the default 
+  <<<MimeTypes>>> detector)
+
+  Because this detector needs to read the whole file to process the
+  container, it must be used with a 
+  {{{./api/org/apache/tika/io/TikaInputStream.html}org.apache.tika.io.TikaInputStream}}.
+  If called with a regular <<<InputStream>>>, then all work will be done
+  by the fallback detector.
+
+  For more information on container formats and Tika, see
+  {{{http://wiki.apache.org/tika/MetadataDiscussion}}}
+
+
+* {Language Detection}
+
+  Tika is able to help identify the language of a piece of text, which
+  is useful when extracting text from document formats which do not include
+  language information in their metadata.
+
+  The language detection is provided by
+  {{{./api/org/apache/tika/language/LanguageIdentifier.html}org.apache.tika.language.LanguageIdentifier}}

Added: tika/site/src/site/apt/1.1/formats.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/formats.apt?rev=1304707&view=auto
==============================================================================
--- tika/site/src/site/apt/1.1/formats.apt (added)
+++ tika/site/src/site/apt/1.1/formats.apt Sat Mar 24 05:30:14 2012
@@ -0,0 +1,145 @@
+                       --------------------------
+                       Supported Document Formats
+                       --------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Supported Document Formats
+
+   This page lists all the document formats supported by Apache Tika 0.6.
+   Follow the links to the various parser class javadocs for more detailed
+   information about each document format and how it is parsed by Tika.
+
+%{toc|section=1|fromDepth=1}
+
+* {HyperText Markup Language}
+
+   The HyperText Markup Language (HTML) is the lingua franca of the web.
+   Tika uses the {{{http://home.ccil.org/~cowan/XML/tagsoup/}TagSoup}}
+   library to support virtually any kind of HTML found on the web.
+   The output from the
+   {{{api/org/apache/tika/parser/html/HtmlParser.html}HtmlParser}} class
+   is guaranteed to be well-formed and valid XHTML, and various heuristics
+   are used to prevent things like inline scripts from cluttering the
+   extracted text content.
+
+* {XML and derived formats}
+
+   The Extensible Markup Language (XML) format is a generic format that can
+   be used for all kinds of content. Tika has custom parsers for some widely
+   used XML vocabularies like XHTML, OOXML and ODF, but the default
+   {{{api/org/apache/tika/parser/xml/DcXMLParser.html}DcXMLParser}}
+   class simply extracts the text content of the document and ignores any XML
+   structure. The only exception to this rule are Dublin Core metadata
+   elements that are used for the document metadata.
+
+* {Microsoft Office document formats}
+
+   Microsoft Office and some related applications produce documents in the
+   generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The
+   older OLE 2 format was introduced in Microsoft Office version 97 and was
+   the default format until Office version 2007 and the new XML-based
+   OOXML format. The
+   {{{api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}}
+   and
+   {{{api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html}OOXMLParser}}
+   classes use {{{http://poi.apache.org/}Apache POI}} libraries to support
+   text and metadata extraction from both OLE2 and OOXML documents.
+
+* {OpenDocument Format}
+
+   The OpenDocument format (ODF) is used most notably as the default format
+   of the OpenOffice.org office suite. The
+   {{{api/org/apache/tika/parser/odf/OpenDocumentParser.html}OpenDocumentParser}}
+   class supports this format and the earlier OpenOffice 1.0 format on which
+   ODF is based.
+
+* {Portable Document Format}
+
+   The {{{api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class
+   parsers Portable Document Format (PDF) documents using the
+   {{{http://pdfbox.apache.org/}Apache PDFBox}} library.
+
+* {Electronic Publication Format}
+
+   The {{{api/org/apache/tika/parser/epub/EpubParser.html}EpubParser}} class
+   supports the Electronic Publication Format (EPUB) used for many digital
+   books.
+
+* {Rich Text Format}
+
+   The {{{api/org/apache/tika/parser/rtf/RTFParser.html}RTFParser}} class
+   uses the standard javax.swing.text.rtf feature to extract text content
+   from Rich Text Format (RTF) documents.
+
+* {Compression and packaging formats}
+
+   Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}}
+   library to support various compression and packaging formats. The
+   {{{api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}}
+   class and its subclasses first parse the top level compression or
+   packaging format and then pass the unpacked document streams to a
+   second parsing stage using the parser instance specified in the
+   parse context.
+
+* {Text formats}
+
+   Extracting text content from plain text files seems like a simple task
+   until you start thinking of all the possible character encodings. The
+   {{{api/org/apache/tika/parser/txt/TXTParser.html}TXTParser}} class uses
+   encoding detection code from the {{{http://site.icu-project.org/}ICU}}
+   project to automatically detect the character encoding of a text document.
+
+* {Audio formats}
+
+   Tika can detect several common audio formats and extract metadata
+   from them. Even text extraction is supported for some audio files that
+   contain lyrics or other textual content. The
+   {{{api/org/apache/tika/parser/audio/AudioParser.html}AudioParser}}
+   and {{{api/org/apache/tika/parser/audio/MidiParser.html}MidiParser}}
+   classes use standard javax.sound features to process simple audio
+   formats, and the
+   {{{api/org/apache/tika/parser/mp3/Mp3Parser.html}Mp3Parser}} class
+   adds support for the widely used MP3 format.
+
+* {Image formats}
+
+   The {{{api/org/apache/tika/parser/image/ImageParser.html}ImageParser}}
+   class uses the standard javax.imageio feature to extract simple metadata
+   from image formats supported by the Java platform. More complex image
+   metadata is available through the
+   {{{api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class
+   that uses the metadata-extractor library to supports Exif metadata
+   extraction from Jpeg images.
+
+* {Video formats}
+
+   Currently Tika only supports the Flash video format using a simple
+   parsing algorithm implemented in the
+   {{{api/org/apache/tika/parser/flv/FLVParser}FLVParser}} class.
+
+* {Java class files and archives}
+
+   The {{{api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class
+   extracts class names and method signatures from Java class files, and
+   the {{{api/org/apache/tika/parser/pkg/ZipParser.html}ZipParser}} class
+   supports also jar archives.
+
+* {The mbox format}
+
+   The {{{api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can
+   extract email messages from the mbox format used by many email archives
+   and Unix-style mailboxes.

Added: tika/site/src/site/apt/1.1/gettingstarted.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/gettingstarted.apt?rev=1304707&view=auto
==============================================================================
--- tika/site/src/site/apt/1.1/gettingstarted.apt (added)
+++ tika/site/src/site/apt/1.1/gettingstarted.apt Sat Mar 24 05:30:14 2012
@@ -0,0 +1,228 @@
+                     --------------------------------
+                     Getting Started with Apache Tika
+                     --------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Getting Started with Apache Tika
+
+ This document describes how to build Apache Tika from sources and
+ how to start using Tika in an application.
+
+Getting and building the sources
+
+ To build Tika from sources you first need to either
+ {{{../download.html}download}} a source release or
+ {{{../source-repository.html}checkout}} the latest sources from
+ version control.
+
+ Once you have the sources, you can build them using the
+ {{{http://maven.apache.org/}Maven 2}} build system. Executing the
+ following command in the base directory will build the sources
+ and install the resulting artifacts in your local Maven repository.
+
+---
+mvn install
+---
+
+ See the Maven documentation for more information about the available
+ build options.
+
+ Note that you need Java 5 or higher to build Tika.
+
+Build artifacts
+
+ The Tika 1.1 build consists of a number of components and produces
+ the following main binaries:
+
+ [tika-core/target/tika-core-1.1.jar]
+  Tika core library. Contains the core interfaces and classes of Tika,
+  but none of the parser implementations. Depends only on Java 5.
+
+ [tika-parsers/target/tika-parsers-1.1.jar]
+  Tika parsers. Collection of classes that implement the Tika Parser
+  interface based on various external parser libraries.
+
+ [tika-app/target/tika-app-1.1.jar]
+  Tika application. Combines the above libraries and all the external
+  parser libraries into a single runnable jar with a GUI and a command
+  line interface.
+
+ [tika-bundle/target/tika-bundle-1.1.jar]
+  Tika bundle. An OSGi bundle that includes everything you need to use all
+  Tika functionality in an OSGi environment.
+
+Using Tika as a Maven dependency
+
+ The core library, tika-core, contains the key interfaces and classes of Tika
+ and can be used by itself if you don't need the full set of parsers from
+ the tika-parsers component. The tika-core dependency looks like this:
+
+---
+  <dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-core</artifactId>
+    <version>1.1</version>
+  </dependency>
+---
+
+ If you want to use Tika to parse documents (instead  of simply detecting
+ document types, etc.), you'll want to depend on tika-parsers instead: 
+
+---
+  <dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-parsers</artifactId>
+    <version>1.1</version>
+  </dependency>
+---
+
+ Note that adding this dependency will introduce a number of
+ transitive dependencies to your project, including one on tika-core.
+ You need to make sure that these dependencies won't conflict with your
+ existing project dependencies. The listing below shows all the
+ compile-scope dependencies of tika-parsers in the Tika 1.1 release.
+
+---
++- org.apache.tika:tika-core:jar:1.1:compile
++- org.gagravarr:vorbis-java-tika:jar:0.1:compile
+|  \- org.gagravarr:vorbis-java-core:jar:tests:0.1:runtime
++- org.apache.felix:org.apache.felix.scr.annotations:jar:1.6.0:provided
++- edu.ucar:netcdf:jar:4.2-min:compile
+|  \- org.slf4j:slf4j-api:jar:1.5.6:compile
++- org.apache.james:apache-mime4j-core:jar:0.7:compile
++- org.apache.james:apache-mime4j-dom:jar:0.7:compile
++- org.apache.commons:commons-compress:jar:1.3:compile
++- commons-codec:commons-codec:jar:1.5:compile
++- org.apache.pdfbox:pdfbox:jar:1.6.0:compile
+|  +- org.apache.pdfbox:fontbox:jar:1.6.0:compile
+|  +- org.apache.pdfbox:jempbox:jar:1.6.0:compile
+|  \- commons-logging:commons-logging:jar:1.1.1:compile
++- org.bouncycastle:bcmail-jdk15:jar:1.45:compile
++- org.bouncycastle:bcprov-jdk15:jar:1.45:compile
++- org.apache.poi:poi:jar:3.8-beta5:compile
++- org.apache.poi:poi-scratchpad:jar:3.8-beta5:compile
++- org.apache.poi:poi-ooxml:jar:3.8-beta5:compile
+|  +- org.apache.poi:poi-ooxml-schemas:jar:3.8-beta5:compile
+|  |  \- org.apache.xmlbeans:xmlbeans:jar:2.3.0:compile
+|  \- dom4j:dom4j:jar:1.6.1:compile
++- org.apache.geronimo.specs:geronimo-stax-api_1.0_spec:jar:1.0.1:compile
++- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
++- asm:asm:jar:3.1:compile
++- com.googlecode.mp4parser:isoparser:jar:1.0-beta-5:compile
+|  \- net.sf.scannotation:scannotation:jar:1.0.2:compile
+|     \- javassist:javassist:jar:3.6.0.GA:compile
++- com.drewnoakes:metadata-extractor:jar:2.4.0-beta-1:compile
++- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
++- rome:rome:jar:0.9:compile
+|  \- jdom:jdom:jar:1.0:compile
++- org.gagravarr:vorbis-java-core:jar:0.1:compile
++- junit:junit:jar:4.10:test
+|  \- org.hamcrest:hamcrest-core:jar:1.1:test
++- org.mockito:mockito-core:jar:1.7:test
+|  \- org.objenesis:objenesis:jar:1.0:test
+\- org.slf4j:slf4j-log4j12:jar:1.5.6:test
+   \- log4j:log4j:jar:1.2.14:test
+
+---
+
+Using Tika in an Ant project
+
+ Unless you use a dependency manager tool like
+ {{{http://ant.apache.org/ivy/}Apache Ivy}}, to use Tika in you application
+ you can include the Tika jar files and the dependencies individually.
+
+---
+<classpath>
+  ... <!-- your other classpath entries -->
+  <pathelement location="path/to/tika-core-1.1.jar"/>
+  <pathelement location="path/to/tika-parsers-1.1.jar"/>
+  <pathelement location="path/to/commons-logging-1.1.1.jar"/>
+  <pathelement location="path/to/commons-compress-1.0.jar"/>
+  <pathelement location="path/to/pdfbox-1.1.0-incubating.jar"/>
+  <pathelement location="path/to/fontbox-1.1.0-incubator.jar"/>
+  <pathelement location="path/to/jempbox-1.1.0-incubator.jar"/>
+  <pathelement location="path/to/poi-3.6.jar"/>
+  <pathelement location="path/to/poi-scratchpad-3.6.jar"/>
+  <pathelement location="path/to/poi-ooxml-3.6.jar"/>
+  <pathelement location="path/to/poi-ooxml-schemas-3.6.jar"/>
+  <pathelement location="path/to/xmlbeans-2.3.0.jar"/>
+  <pathelement location="path/to/dom4j-1.6.1.jar"/>
+  <pathelement location="path/to/xml-apis-1.0.b2.jar"/>
+  <pathelement location="path/to/geronimo-stax-api_1.0_spec-1.0.jar"/>
+  <pathelement location="path/to/tagsoup-1.2.jar"/>
+  <pathelement location="path/to/asm-3.1.jar"/>
+  <pathelement location="path/to/log4j-1.2.14.jar"/>
+  <pathelement location="path/to/metadata-extractor-2.4.0-beta-1.jar"/>
+</classpath>
+---
+
+ An easy way to gather all these libraries is to run
+ "mvn dependency:copy-dependencies" in the tika-parsers source directory.
+ This will copy all Tika dependencies to the <<<target/dependencies>>>
+ directory.
+
+ Alternatively you can simply drop the entire tika-app jar to your
+ classpath to get all of the above dependencies in a single archive.
+
+Using Tika as a command line utility
+
+ The Tika application jar (tika-app-1.1.jar) can be used as a command
+ line utility for extracting text content and metadata from all sorts of
+ files. This runnable jar contains all the dependencies it needs, so
+ you don't need to worry about classpath settings to run it.
+
+ The usage instructions are shown below.
+
+---
+usage: java -jar tika-app-1.1.jar [option] [file]
+
+Options:
+    -? or --help       Print this usage message
+    -v or --verbose    Print debug level messages
+    -g or --gui        Start the Apache Tika GUI
+    -x or --xml        Output XHTML content (default)
+    -h or --html       Output HTML content
+    -t or --text       Output plain text content
+    -m or --metadata   Output only metadata
+
+Description:
+    Apache Tika will parse the file(s) specified on the
+    command line and output the extracted text content
+    or metadata to standard output.
+
+    Instead of a file name you can also specify the URL
+    of a document to be parsed.
+
+    If no file name or URL is specified (or the special
+    name "-" is used), then the standard input stream
+    is parsed.
+
+    Use the "--gui" (or "-g") option to start
+    the Apache Tika GUI. You can drag and drop files
+    from a normal file explorer to the GUI window to
+    extract text content and metadata from the files.
+---
+
+ You can also use the jar as a component in a Unix pipeline or
+ as an external tool in many scripting languages.
+
+---
+# Check if an Internet resource contains a specific keyword
+curl http://.../document.doc \
+  | java -jar tika-app-1.1.jar --text \
+  | grep -q keyword
+---

Added: tika/site/src/site/apt/1.1/index.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/index.apt?rev=1304707&view=auto
==============================================================================
--- tika/site/src/site/apt/1.1/index.apt (added)
+++ tika/site/src/site/apt/1.1/index.apt Sat Mar 24 05:30:14 2012
@@ -0,0 +1,176 @@
+                       ---------------
+                       Apache Tika 0.8
+                       ---------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Apache Tika 1.1
+
+
+   The most notable changes in Tika 1.1 over the previous release are:
+
+      * Link Extraction: The rel attribute is now extracted from links 
+        per the LinkConteHandler. 
+        ({{{http://issues.apache.org/jira/browse/TIKA-824}TIKA-824}})
+        
+      * MP3: Fixed handling of UTF-16 (two byte) ID3v2 tags (previously 
+        the last character in a UTF-16 tag could be corrupted) 
+        ({{{http://issues.apache.org/jira/browse/TIKA-793}TIKA-793}})
+        
+      * Performance: Loading of the default media type registry is now 
+        significantly faster. 
+        ({{{http://issues.apache.org/jira/browse/TIKA-780}TIKA-780}})
+        
+      * PDF: Allow controlling whether overlapping duplicated text should 
+        be removed.  Disabling this (the default) can give big speedups to 
+        text extraction and may workaround cases where non-duplicated 
+        characters were incorrectly removed 
+        ({{{http://issues.apache.org/jira/browse/TIKA-767}TIKA-767}}).
+        Allow controlling whether text tokens should be sorted by their x/y 
+        position before extracting text 
+        ({{{http://issues.apache.org/jira/browse/TIKA-612}TIKA-612}}); 
+        this is necessary for certain PDFs.  Fixed cases where too many 
+        </p> tags appear in the XHTML output, causing NPE when opening 
+        some PDFs with the GUI 
+        ({{{http://issues.apache.org/jira/browse/TIKA-778}TIKA-778}}).
+        
+      * RTF: Fixed case where a font change would result in processing 
+        bytes in the wrong font's charset, producing bogus text output 
+        ({{{http://issues.apache.org/jira/browse/TIKA-777}TIKA-777}}).  
+        Don't output whitespace in ignored group states, avoiding 
+        excessive whitespace output 
+        ({{{http://issues.apache.org/jira/browse/TIKA-781}TIKA-781}}).  
+        Binary embedded content (using \bin control word) is now skipped 
+        correctly; previously it could cause the parser to incorrectly 
+        extract binary content as text
+        ({{{http://issues.apache.org/jira/browse/TIKA-782}TIKA-782}}).
+      
+      * CLI: New TikaCLI option "--list-detectors", which displays the 
+        mimetype detectors that are available, similar to the existing 
+        "--list-parsers" option for parsers. 
+        ({{{http://issues.apache.org/jira/browse/TIKA-785}TIKA-785}}).
+        
+      * Detectors: The order of detectors, as supplied via the service
+        registry loader, is now controlled. User supplied detectors are 
+        prefered, then Tika detectors (such as the container aware ones), 
+        and finally the core Tika MimeTypes is used as a backup. This 
+        allows for specific, detailed detectors to take preference over 
+        the default mime magic + filename detector. 
+        ({{{http://issues.apache.org/jira/browse/TIKA-786}TIKA-786}})
+        
+      * Microsoft Project (MPP): Filetype detection has been fixed, and 
+        basic metadata (but no text) is now extracted. 
+        ({{{http://issues.apache.org/jira/browse/TIKA-789}TIKA-789}})
+        
+      * Outlook: fixed NullPointerException in TikaGUI when messages with
+        embedded RTF or HTML content were filtered 
+        ({{{http://issues.apache.org/jira/browse/TIKA-801}TIKA-801}}).
+        
+      * Ogg Vorbis and FLAC: Parser added for Ogg Vorbis and FLAC audio
+        files, which extract audio metadata and tags 
+        ({{{http://issues.apache.org/jira/browse/TIKA-747}TIKA-747}}).
+        
+      * MP4: Improved mime magic detection for MP4 based formats (including
+        QuickTime, MP4 Video and Audio, and 3GPP) 
+        ({{{http://issues.apache.org/jira/browse/TIKA-851}TIKA-851}}).
+        
+      * MP4: Basic metadata extracting parser for MP4 files added, which includes
+        limited audio and video metadata, along with the iTunes media metadata
+        (such as Artist and Title) 
+        ({{{http://issues.apache.org/jira/browse/TIKA-852}TIKA-852}}).
+        
+      * Document Passwords: A new ParseContext object, PasswordProvider, 
+        has been added. This provides a way to supply the password for 
+        a document during processing. Currently, only password protected 
+        PDFs and Microsoft OOXML Files are supported. 
+        ({{{http://issues.apache.org/jira/browse/TIKA-850}TIKA-850}}).   
+
+   The following people have contributed to Tika 1.1 by submitting or
+   commenting on the issues resolved in this release:
+
+      * Alex Ott
+      
+      * Alexander Chow 
+      
+      * Ali Oral 
+      
+      * Andrzej Bialecki
+      
+      * Antoni Mylka
+      
+      * Arjohn Kampman
+      
+      * Bastian Mathes
+      
+      * Chris A. Mattmann
+      
+      * Craig Stires
+      
+      * David Tran
+      
+      * Etienne Jouvin
+      
+      * Fabian Lange
+      
+      * Geoff Jarrad
+      
+      * Jan H¿ydahl
+      
+      * Jerome Lacoste
+      
+      * John Mastarone
+      
+      * Jukka Zitting
+      
+      * Julien Nioche 
+      
+      * Ken Krugler
+      
+      * Lau Brino
+      
+      * Markus Jelsma 
+      
+      * Maxim Valyanskiy
+      
+      * Michael McCandless
+      
+      * Nick Burch
+      
+      * Pablo Queixalos 
+      
+      * Paul Hill
+      
+      * Paul Pearcy 
+      
+      * peter royal
+      
+      * PNS
+      
+      * Radek
+      
+      * Ray Gauss II 
+      
+      * Stephan MŸhlstrasser
+      
+      * Swapna Vuppala
+      
+      * Torsten Krah 
+      
+      * William Seemann
+      
+      * Yegor Kozlov 
+
+   See {{http://s.apache.org/Jn4}} for more details on these contributions.

Added: tika/site/src/site/apt/1.1/parser.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/parser.apt?rev=1304707&view=auto
==============================================================================
--- tika/site/src/site/apt/1.1/parser.apt (added)
+++ tika/site/src/site/apt/1.1/parser.apt Sat Mar 24 05:30:14 2012
@@ -0,0 +1,245 @@
+                       --------------------
+                       The Parser interface
+                       --------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+The Parser interface
+
+   The
+   {{{api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
+   interface is the key concept of Apache Tika. It hides the complexity of
+   different file formats and parsing libraries while providing a simple and
+   powerful mechanism for client applications to extract structured text
+   content and metadata from all sorts of documents. All this is achieved
+   with a single method:
+
+---
+void parse(
+    InputStream stream, ContentHandler handler, Metadata metadata,
+    ParseContext context) throws IOException, SAXException, TikaException;
+---
+
+   The <<<parse>>> method takes the document to be parsed and related metadata
+   as input and outputs the results as XHTML SAX events and extra metadata.
+   The parse context argument is used to specify context information (like
+   the current local) that is not related to any individual document.
+   The main criteria that lead to this design were:
+
+   [Streamed parsing] The interface should require neither the client
+     application nor the parser implementation to keep the full document
+     content in memory or spooled to disk. This allows even huge documents
+     to be parsed without excessive resource requirements.
+
+   [Structured content] A parser implementation should be able to
+     include structural information (headings, links, etc.) in the extracted
+     content. A client application can use this information for example to
+     better judge the relevance of different parts of the parsed document.
+
+   [Input metadata] A client application should be able to include metadata
+     like the file name or declared content type with the document to be
+     parsed. The parser implementation can use this information to better
+     guide the parsing process.
+
+   [Output metadata] A parser implementation should be able to return
+     document metadata in addition to document content. Many document
+     formats contain metadata like the name of the author that may be useful
+     to client applications.
+
+   [Context sensitivity] While the default settings and behaviour of Tika
+     parsers should work well for most use cases, there are still situations
+     where more fine-grained control over the parsing process is desirable.
+     It should be easy to inject such context-specific information to the
+     parsing process without breaking the layers of abstraction.
+
+   []
+
+   These criteria are reflected in the arguments of the <<<parse>>> method.
+
+* Document input stream
+
+   The first argument is an
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
+   for reading the document to be parsed.
+
+   If this document stream can not be read, then parsing stops and the thrown
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
+   is passed up to the client application. If the stream can be read but
+   not parsed (for example if the document is corrupted), then the parser
+   throws a {{{api/org/apache/tika/exception/TikaException.html}TikaException}}.
+
+   The parser implementation will consume this stream but <will not close it>.
+   Closing the stream is the responsibility of the client application that
+   opened it in the first place. The recommended pattern for using streams
+   with the <<<parse>>> method is:
+
+---
+InputStream stream = ...;      // open the stream
+try {
+    parser.parse(stream, ...); // parse the stream
+} finally {
+    stream.close();            // close the stream
+}
+---
+
+   Some document formats like the OLE2 Compound Document Format used by
+   Microsoft Office are best parsed as random access files. In such cases the
+   content of the input stream is automatically spooled to a temporary file
+   that gets removed once parsed. A future version of Tika may make it possible
+   to avoid this extra file if the input document is already a file in the
+   local file system. See
+   {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
+   of this feature request.
+
+* XHTML SAX events
+
+   The parsed content of the document stream is returned to the client
+   application as a sequence of XHTML SAX events. XHTML is used to express
+   structured content of the document and SAX events enable streamed
+   processing. Note that the XHTML format is used here only to convey
+   structural information, not to render the documents for browsing!
+
+   The XHTML SAX events produced by the parser implementation are sent to a
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/ContentHandler.html}ContentHandler}}
+   instance given to the <<<parse>>> method. If this the content handler
+   fails to process an event, then parsing stops and the thrown
+   {{{http://java.sun.com/j2se/1.5.0/docs/api/org/xml/sax/SAXException.html}SAXException}}
+   is passed up to the client application.
+
+   The overall structure of the generated event stream is (with indenting
+   added for clarity):
+
+---
+<html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+    <title>...</title>
+  </head>
+  <body>
+    ...
+  </body>
+</html>
+---
+
+   Parser implementations typically use the
+   {{{apidocs/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
+   utility class to generate the XHTML output.
+
+   Dealing with the raw SAX events can be a bit complex, so Apache Tika
+   comes with a number of utility classes that can be used to process and
+   convert the event stream to other representations.
+
+   For example, the
+   {{{api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+   class can be used to extract just the body part of the XHTML output and
+   feed it either as SAX events to another content handler or as characters
+   to an output stream, a writer, or simply a string. The following code
+   snippet parses a document from the standard input stream and outputs the
+   extracted text content to standard output:
+
+---
+ContentHandler handler = new BodyContentHandler(System.out);
+parser.parse(System.in, handler, ...);
+---
+
+   Another useful class is
+   {{{api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
+   uses a background thread to parse the document and returns the extracted
+   text content as a character stream:
+
+---
+InputStream stream = ...; // the document to be parsed
+Reader reader = new ParsingReader(parser, stream, ...);
+try {
+    ...;                  // read the document text using the reader
+} finally {
+    reader.close();       // the document stream is closed automatically
+}
+---
+
+* Document metadata
+
+   The third argument to the <<<parse>>> method is used to pass document
+   metadata both in and out of the parser. Document metadata is expressed
+   as an {{{api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
+
+   The following are some of the more interesting metadata properties:
+
+   [Metadata.RESOURCE_NAME_KEY] The name of the file or resource that contains
+    the document.
+
+    A client application can set this property to allow the parser to use
+    file name heuristics to determine the format of the document.
+
+    The parser implementation may set this property if the file format
+    contains the canonical name of the file (for example the Gzip format
+    has a slot for the file name).
+
+   [Metadata.CONTENT_TYPE] The declared content type of the document.
+
+    A client application can set this property based on for example a HTTP
+    Content-Type header. The declared content type may help the parser to
+    correctly interpret the document.
+
+    The parser implementation sets this property to the content type according
+    to which the document was parsed.
+
+   [Metadata.TITLE] The title of the document.
+
+    The parser implementation sets this property if the document format
+    contains an explicit title field.
+
+   [Metadata.AUTHOR] The name of the author of the document.
+
+    The parser implementation sets this property if the document format
+    contains an explicit author field.
+
+   []
+
+   Note that metadata handling is still being discussed by the Tika development
+   team, and it is likely that there will be some (backwards incompatible)
+   changes in metadata handling before Tika 1.0.
+
+* Parse context
+
+   The final argument to the <<<parse>>> method is used to inject
+   context-specific information to the parsing process. This is useful
+   for example when dealing with locale-specific date and number formats
+   in Microsoft Excel spreadsheets. Another important use of the parse
+   context is passing in the delegate parser instance to be used by
+   two-phase parsers like the
+   {{{api/org/apache/parser/pkg/PackageParser.html}PackageParser}} subclasses.
+   Some parser classes allow customization of the parsing process through
+   strategy objects in the parse context.
+
+* Parser implementations
+
+   Apache Tika comes with a number of parser classes for parsing
+   {{{formats.html}various document formats}}. You can also extend Tika
+   with your own parsers, and of course any contributions to Tika are
+   warmly welcome.
+
+   The goal of Tika is to reuse existing parser libraries like
+   {{{http://www.pdfbox.org/}PDFBox}} or
+   {{{http://poi.apache.org/}Apache POI}} as much as possible, and so most
+   of the parser classes in Tika are adapters to such external libraries.
+
+   Tika also contains some general purpose parser implementations that are
+   not targeted at any specific document formats. The most notable of these
+   is the {{{apidocs/org/apache/tika/parser/AutoDetectParser.html}AutoDetectParser}}
+   class that encapsulates all Tika functionality into a single parser that
+   can handle any types of documents. This parser will automatically determine
+   the type of the incoming document based on various heuristics and will then
+   parse the document accordingly.

Added: tika/site/src/site/apt/1.1/parser_guide.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.1/parser_guide.apt?rev=1304707&view=auto
==============================================================================
--- tika/site/src/site/apt/1.1/parser_guide.apt (added)
+++ tika/site/src/site/apt/1.1/parser_guide.apt Sat Mar 24 05:30:14 2012
@@ -0,0 +1,135 @@
+                       --------------------------------------------
+                       Get Tika parsing up and running in 5 minutes
+                       --------------------------------------------
+					   Arturo Beltran
+					   --------------------------------------------
+
+~~ Licensed to the Apache Software Foundation (ASF) under one or more
+~~ contributor license agreements.  See the NOTICE file distributed with
+~~ this work for additional information regarding copyright ownership.
+~~ The ASF licenses this file to You under the Apache License, Version 2.0
+~~ (the "License"); you may not use this file except in compliance with
+~~ the License.  You may obtain a copy of the License at
+~~
+~~     http://www.apache.org/licenses/LICENSE-2.0
+~~
+~~ Unless required by applicable law or agreed to in writing, software
+~~ distributed under the License is distributed on an "AS IS" BASIS,
+~~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+~~ See the License for the specific language governing permissions and
+~~ limitations under the License.
+
+Get Tika parsing up and running in 5 minutes
+
+   This page is a quick start guide showing how to add a new parser to Apache Tika.
+   Following the simple steps listed below your new parser can be running in only 5 minutes.
+
+%{toc|section=1|fromDepth=1}
+
+* {Getting Started}
+
+   The {{{gettingstarted.html}Getting Started}} document describes how to 
+   build Apache Tika from sources and how to start using Tika in an application. Pay close attention 
+   and follow the instructions in the "Getting and building the sources" section.
+   
+
+* {Add your MIME-Type}
+
+   You first need to modify {{{http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml}}
+   in order to Tika can map the file extension with its MIME-Type. You should add something like this:
+   
+---
+ <mime-type type="application/hello">
+	<glob pattern="*.hi"/>
+ </mime-type>
+---
+
+* {Create your Parser class}
+
+   Now, you need to create your new parser. This is a class that must implement the Parser interface 
+   offered by Tika. A very simple Tika Parser looks like this:
+   
+---
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ * 
+ * @Author: Arturo Beltran
+ */
+package org.apache.tika.parser.hello;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.util.Collections;
+import java.util.Set;
+
+import org.apache.tika.exception.TikaException;
+import org.apache.tika.metadata.Metadata;
+import org.apache.tika.mime.MediaType;
+import org.apache.tika.parser.ParseContext;
+import org.apache.tika.parser.Parser;
+import org.apache.tika.sax.XHTMLContentHandler;
+import org.xml.sax.ContentHandler;
+import org.xml.sax.SAXException;
+
+public class HelloParser implements Parser {
+
+	private static final Set<MediaType> SUPPORTED_TYPES = Collections.singleton(MediaType.application("hello"));
+	public static final String HELLO_MIME_TYPE = "application/hello";
+	
+	public Set<MediaType> getSupportedTypes(ParseContext context) {
+		return SUPPORTED_TYPES;
+	}
+
+	public void parse(
+			InputStream stream, ContentHandler handler,
+			Metadata metadata, ParseContext context)
+			throws IOException, SAXException, TikaException {
+
+		metadata.set(Metadata.CONTENT_TYPE, HELLO_MIME_TYPE);
+		metadata.set("Hello", "World");
+
+		XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata);
+		xhtml.startDocument();
+		xhtml.endDocument();
+	}
+
+	/**
+	 * @deprecated This method will be removed in Apache Tika 1.0.
+	 */
+	public void parse(
+			InputStream stream, ContentHandler handler, Metadata metadata)
+			throws IOException, SAXException, TikaException {
+		parse(stream, handler, metadata, new ParseContext());
+	}
+}
+---
+   
+   Pay special attention to the definition of the SUPPORTED_TYPES static class 
+   field in the parser class that defines what MIME-Types it supports. 
+   
+   Is in the "parse" method where you will do all your work. This is, extract 
+   the information of the resource and then set the metadata.
+
+* {List the new parser}
+
+   Finally, you should explicitly tell the AutoDetectParser to include your new 
+   parser. This step is only needed if you want to use the AutoDetectParser functionality. 
+   If you figure out the correct parser in a different way, it isn't needed. 
+   
+   List your new parser in:
+    {{{http://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser}}
+   
+

Modified: tika/site/src/site/apt/download.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/download.apt?rev=1304707&r1=1304706&r2=1304707&view=diff
==============================================================================
--- tika/site/src/site/apt/download.apt (original)
+++ tika/site/src/site/apt/download.apt Sat Mar 24 05:30:14 2012
@@ -19,19 +19,19 @@
 
 Download Apache Tika
 
-   Apache Tika 1.0 is now available.
-   See the {{{http://www.apache.org/dist/tika/CHANGES-1.0.txt}CHANGES.txt}}
+   Apache Tika 1.1 is now available.
+   See the {{{http://www.apache.org/dist/tika/CHANGES-1.1.txt}CHANGES.txt}}
    file for more information on the list of updates in this initial release.
 
-   * {{{http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip}apache-tika-1.0-src.zip}}
-     (source archive, {{{http://www.apache.org/dist/tika/apache-tika-1.0-src.zip.asc}PGP signature}})\
-     SHA1: <<<203d84b56c5b8879ce04b496e9b7421387ea386e>>>\
-     MD5: <<<65e82bb15754bbc9f7122dcaf6813831>>>
+   * {{{http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.1-src.zip}apache-tika-1.1-src.zip}}
+     (source archive, {{{http://www.apache.org/dist/tika/apache-tika-1.1-src.zip.asc}PGP signature}})\
+     SHA1: <<<d3185bb22fa3c7318488838989aff0cc9ee025df>>>\
+     MD5: <<<927134622b1c445b5f814f47495495a1>>>
 
    * {{{http://www.apache.org/dyn/closer.cgi/tika/tika-app-1.0.jar}tika-app-1.0.jar}}
      (runnable jar, {{{http://www.apache.org/dist/tika/tika-app-1.0.jar.asc}PGP signature}})\
-     SHA1: <<<25c6e1a77b5e88f8e23db6c074ec95b9b24fb7f2>>>\
-     MD5: <<<9f94067bab5258e70ffa6a79357c11ef>>>
+     SHA1: <<<6c442b0b4b4dfa2d80c78ecaa70b9a5be8a86991>>>\
+     MD5: <<<c69f77dc7f10ab240ed1939687a45574>>>
 
    []
 

Modified: tika/site/src/site/apt/index.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/index.apt?rev=1304707&r1=1304706&r2=1304707&view=diff
==============================================================================
--- tika/site/src/site/apt/index.apt (original)
+++ tika/site/src/site/apt/index.apt Sat Mar 24 05:30:14 2012
@@ -23,7 +23,7 @@ Apache Tika - a content analysis toolkit
    structured text content from various documents using existing parser
    libraries. You can find the latest release on the
    {{{./download.html}download page}}. See the
-   {{{./0.10/gettingstarted.html}Getting Started}} guide for instructions on
+   {{{./0.11/gettingstarted.html}Getting Started}} guide for instructions on
    how to start using Tika.
 
    Tika is a project of the
@@ -32,6 +32,14 @@ Apache Tika - a content analysis toolkit
 
 Latest News
 
+   [23 March 2012: Apache Tika Release]
+    Apache Tika 1.1 is out the door! We've made a number of improvements to 
+    PDF, RTF and MP3 parsing. We've also provided some new features on the
+    command line including the ability to list detectors. Other bug fixes and
+    improvements are listed in the {{{http://www.apache.org/dist/tika/CHANGES-1.1.txt}CHANGES.txt}
+    file for this release. Have a look at the download page for more information
+    on the release.
+
    [7 November 2011: Apache Tika Release]
     Apache Tika 1.0 has been released, just in time for ApacheCon NA 2011!
     The 1.0 release of Tika removes all deprecated pre 1.0 API methods, makes 

Modified: tika/site/src/site/site.xml
URL: http://svn.apache.org/viewvc/tika/site/src/site/site.xml?rev=1304707&r1=1304706&r2=1304707&view=diff
==============================================================================
--- tika/site/src/site/site.xml (original)
+++ tika/site/src/site/site.xml Sat Mar 24 05:30:14 2012
@@ -39,7 +39,15 @@
       <item name="Issue Tracker" href="https://issues.apache.org/jira/browse/TIKA"/>
     </menu>
     <menu name="Documentation">
-      <item name="Apache Tika 1.0" href="1.0/index.html">
+      <item name="Apache Tika 1.1" href="1.1/index.html">
+        <item name="Getting Started" href="1.1/gettingstarted.html"/>
+        <item name="Supported Formats" href="1.1/formats.html"/>
+        <item name="Parser API" href="1.1/parser.html"/>
+        <item name="Parser 5min Quick Start Guide" href="1.1/parser_guide.html"/>
+        <item name="Content and Language Detection" href="1.1/detection.html"/>
+        <item name="API Documentation" href="1.1/api/"/>
+      </item>
+      <item name="Apache Tika 1.0" href="1.0/index.html" collapse="true">
         <item name="Getting Started" href="1.0/gettingstarted.html"/>
         <item name="Supported Formats" href="1.0/formats.html"/>
         <item name="Parser API" href="1.0/parser.html"/>