You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2010/01/31 01:37:56 UTC

svn commit: r904936 - in /lucene/tika/site/src/site: apt/0.6/ apt/0.6/documentation.apt apt/0.6/formats.apt apt/0.6/index.apt apt/0.6/parser.apt site.xml

Author: jukka
Date: Sun Jan 31 00:37:55 2010
New Revision: 904936

URL: http://svn.apache.org/viewvc?rev=904936&view=rev
Log:
site: Add Tika 0.6 documentation

Added:
    lucene/tika/site/src/site/apt/0.6/
      - copied from r904893, lucene/tika/site/src/site/apt/0.5/
    lucene/tika/site/src/site/apt/0.6/parser.apt
      - copied, changed from r904893, lucene/tika/site/src/site/apt/0.5/documentation.apt
Removed:
    lucene/tika/site/src/site/apt/0.6/documentation.apt
Modified:
    lucene/tika/site/src/site/apt/0.6/formats.apt
    lucene/tika/site/src/site/apt/0.6/index.apt
    lucene/tika/site/src/site/site.xml

Modified: lucene/tika/site/src/site/apt/0.6/formats.apt
URL: http://svn.apache.org/viewvc/lucene/tika/site/src/site/apt/0.6/formats.apt?rev=904936&r1=904893&r2=904936&view=diff
==============================================================================
--- lucene/tika/site/src/site/apt/0.6/formats.apt (original)
+++ lucene/tika/site/src/site/apt/0.6/formats.apt Sun Jan 31 00:37:55 2010
@@ -19,294 +19,127 @@
 
 Supported Document Formats
 
-   This page lists all the document formats supported by Apache Tika.
+   This page lists all the document formats supported by Apache Tika 0.6.
+   Follow the links to the various parser class javadocs for more detailed
+   information about each document format and how it is parsed by Tika.
+
+%{toc|section=1|fromDepth=1}
+
+* {HyperText Markup Language}
+
+   The HyperText Markup Language (HTML) is the lingua franca of the web.
+   Tika uses the {{{http://home.ccil.org/~cowan/XML/tagsoup/}TagSoup}}
+   library to support virtually any kind of HTML found on the web.
+   The output from the
+   {{{api/org/apache/tika/parser/html/HtmlParser.html}HtmlParser}} class
+   is guaranteed to be well-formed and valid XHTML, and various heuristics
+   are used to prevent things like inline scripts from cluttering the
+   extracted text content.
+
+* {XML and derived formats}
+
+   The Extensible Markup Language (XML) format is a generic format that can
+   be used for all kinds of content. Tika has custom parsers for some widely
+   used XML vocabularies like XHTML, OOXML and ODF, but the default
+   {{{api/org/apache/tika/parser/xml/DcXMLParser.html}DcXMLParser}}
+   class simply extracts the text content of the document and ignores any XML
+   structure. The only exception to this rule are Dublin Core metadata
+   elements that are used for the document metadata.
+
+* {Microsoft Office document formats}
+
+   Microsoft Office and some related applications produce documents in the
+   generic OLE 2 Compound Document and Office Open XML (OOXML) formats. The
+   older OLE 2 format was introduced in Microsoft Office version 97 and was
+   the default format until Office version 2007 and the new XML-based
+   OOXML format. The
+   {{{api/org/apache/tika/parser/microsoft/OfficeParser.html}OfficeParser}}
+   and
+   {{{api/org/apache/tika/parser/microsoft/ooxml/OOXMLParser.html}OOXMLParser}}
+   classes use {{{http://poi.apache.org/}Apache POI}} libraries to support
+   text and metadata extraction from both OLE2 and OOXML documents.
+
+* {OpenDocument Format}
+
+   The OpenDocument format (ODF) is used most notably as the default format
+   of the OpenOffice.org office suite. The
+   {{{api/org/apache/tika/parser/odf/OpenDocumentParser.html}OpenDocumentParser}}
+   class supports this format and the earlier OpenOffice 1.0 format on which
+   ODF is based.
+
+* {Portable Document Format}
+
+   The {{{api/org/apache/tika/parser/pdf/PDFParser.html}PDFParser}} class
+   parsers Portable Document Format (PDF) documents using the
+   {{{http://pdfbox.apache.org/}Apache PDFBox}} library.
+
+* {Electronic Publication Format}
+
+   The {{{api/org/apache/tika/parser/epub/EpubParser.html}EpubParser}} class
+   supports the Electronic Publication Format (EPUB) used for many digital
+   books.
+
+* {Rich Text Format}
+
+   The {{{api/org/apache/tika/parser/rtf/RTFParser.html}RTFParser}} class
+   uses the standard javax.swing.text.rtf feature to extract text content
+   from Rich Text Format (RTF) documents.
+
+* {Compression and packaging formats}
+
+   Tika uses the {{{http://commons.apache.org/compress/}Commons Compress}}
+   library to support various compression and packaging formats. The
+   {{{api/org/apache/tika/parser/pkg/PackageParser.html}PackageParser}}
+   class and its subclasses first parse the top level compression or
+   packaging format and then pass the unpacked document streams to a
+   second parsing stage using the parser instance specified in the
+   parse context.
+
+* {Text formats}
+
+   Extracting text content from plain text files seems like a simple task
+   until you start thinking of all the possible character encodings. The
+   {{{api/org/apache/tika/parser/txt/TXTParser.html}TXTParser}} class uses
+   encoding detection code from the {{{http://site.icu-project.org/}ICU}}
+   project to automatically detect the character encoding of a text document.
 
-* Microsoft's OLE 2 Compound Document format
-
-   A number of Microsoft applications, most notably the Microsoft Office
-   suite, use the generic OLE 2 Compound Document format as the basis of
-   their document formats. Tika uses {{{http://poi.apache.org/}Apache POI}}
-   to support a number of these formats.
-
-   The OLE2 Compound Document format is designed for use with random access
-   files, and so the input stream passed to a Tika parser needs to be spooled
-   in memory or in a temporary file depending on the size of the document.
-   See {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for an
-   effort to avoid this extra temporary file if the input document already
-   comes from a file.
-
-   In addition to the shared base format there's also a shared sets of
-   metadata in typical OLE2 documents. Tika uses the
-   {{{http://poi.apache.org/hpsf/}HPSF library}} from POI to parse these
-   property sets and exposes them as the following document metadata:
-
-      * <<<TITLE>>> Title
-
-      * <<<SUBJECT>>> Subject
-
-      * <<<AUTHOR>>> Author
-
-      * <<<KEYWORDS>>> Keywords
-
-      * <<<COMMENTS>>> Comments
-
-      * <<<TEMPLATE>>> Template
-
-      * <<<LAST_SAVED>>> Last Saved By
-
-      * <<<REVISION_NUMBER>>> Revision Number
-
-      * <<<LAST_PRINTED>>> Last Printed
-
-      * <<<LAST_SAVED>>> Last Saved Time/Date
-
-      * <<<LAST_SAVED>>> Last Saved Time/Date
-
-      * <<<PAGE_COUNT>>> Number of Pages
-
-      * <<<WORD_COUNT>>> Number of Words
-
-      * <<<CHARACTER_COUNT>>> Number of Characters
-
-      * <<<APPLICATION_NAME>>> Name of Creating Application
-
-   Note that in practice the metadata in many documents is either missing,
-   incomplete or even incorrect, so a client application should not rely
-   too much on this information.
-
-   Support for the new Office Open XML format used by Microsoft Office
-   version 2007 is pending for a POI upgrade. Current status is recorded in
-   {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}.
-
-   The generic OLE2 Compound Document format is automatically detected using
-   a magic number, and further parsing can automatically determine the more
-   specific document format. Tika also knows a number of common glob patterns
-   like <<<*.doc>>> and <<<*.ppt>>> for these formats.
-
-   The supported OLE 2 Compound Document formats are:
-
-   [Microsoft Excel (application/vnd.ms-excel)]
-    Excel spreadsheet support is available in all versions of Tika and is
-    based on the {{{http://poi.apache.org/hssf/}HSSF library}} from POI.
-
-    The Excel parser in Tika uses the
-    {{{http://poi.apache.org/hssf/how-to.html#event_api}HSSF event API}} and
-    is able to extract much of the document structure, including all
-    (non-empty) worksheets and their table structures. Formula results are
-    extracted as stored in the Excel file, and cell links are exposed as
-    XHTML links. These features were added in Tika version 0.2.
-
-    Cell comments and formatting are currently not supported. See
-    {{{https://issues.apache.org/jira/browse/TIKA-148}TIKA-148}} and
-    {{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}} for the
-    respective issues.
-
-    See the {{{xref-test/org/apache/tika/parser/microsoft/ExcelParserTest.html}ExcelParserTest}}
-    test case for an example of parsing Microsoft Excel files.
-
-   [Microsoft Word (application/msword)]
-    Word document support is available in all versions of Tika and is based
-    on the {{{http://poi.apache.org/hwpf/}HWPF library}} from POI.
-
-    The Word parser uses the
-    {{{http://poi.apache.org/apidocs/org/apache/poi/hwpf/extractor/WordExtractor.html}WordExtractor}}
-    class from HWPF to extract document content as a sequence of paragraphs.
-
-    See the {{{xref-test/org/apache/tika/parser/microsoft/WordParserTest.html}WordParserTest}}
-    test case for an example of parsing Microsoft Word files.
-
-   [Microsoft PowerPoint (application/vnd.ms-powerpoint)]
-    PowerPoint presentation support is available in all versions of Tika and
-    is based on the {{{http://poi.apache.org/hslf/}HSLF library}} from POI.
-
-    The PowerPoint parser uses the
-    {{{http://poi.apache.org/apidocs/org/apache/poi/hslf/extractor/PowerPointExtractor.html}PowerPointExtractor}}
-    class from HSLF to extract spreadsheet content as a single paragraph.
-
-    See the {{{xref-test/org/apache/tika/parser/microsoft/PowerPointParserTest.html}PowerPointParserTest}}
-    test case for an example of parsing Microsoft PowerPoint files.
-
-   [Microsoft Visio (application/vnd.visio)]
-    Visio diagram support was added in Tika version 0.2 and is based on the
-    {{{http://poi.apache.org/hdgf/}HDGF library}} from POI.
-
-    The Visio parser uses the
-    {{{http://poi.apache.org/apidocs/org/apache/poi/hdgf/extractor/VisioTextExtractor.html}VisioExtractor}}
-    class from HDGF to extract diagram content as a sequence of paragraphs.
-
-   [Microsoft Outlook (application/vnd.ms-outlook)]
-    Outlook message support was added in Tika version 0.2 and is based on the
-    {{{http://poi.apache.org/hsmf/}HSMF library}} from POI.
-
-    The Outlook parser extracts the subject of the message and the From,
-    To, Cc, and Bcc addresses (formatted for display) along with the body
-    text of text/plain messages. The <<<AUTHOR>>>, <<<TITLE>>> and
-    <<<SUBJECT>>> metadata properties are set explicitly, overriding
-    potential generic document metadata retrieved from OLE2 property sets.
-
-* Compression formats
-
-   General purpose compression formats are used to reduce the size of
-   any kinds of documents. Tika uses a parsing pipeline to support general
-   purpose compression: in the first stage the compressed stream decompressed
-   and the resulting decompressed stream is passed on to a second parsing
-   stage where it will be processed as if the document had never been
-   compressed.
-
-   Tika contains magic numbers and glob patterns for auto-detecting all
-   supported compression formats. The glob patterns of compression formats
-   are also used to determine the name of the original uncompressed document.
-   If a client application has supplied a <<<RESOURCE_NAME_KEY>>> metadata
-   property that matches such a glob pattern, then the decompressing first
-   parsing stage will replace the <<<RESOURCE_NAME_KEY>>> metadata property
-   with the deduced original document name before passing control to the
-   second parsing stage.
-
-   Note that apart from the special handling of the <<<RESOURCE_NAME_KEY>>>
-   property, no document metadata is passed to or from the second parsing
-   stage. Only the text content extracted by the second stage parser is
-   returned to the client application.
-
-   The supported compression formats are:
-
-   [gzip compression (application/x-gzip)]
-    {{{http://en.wikipedia.org/wiki/Gzip}Gzip}} support was added in
-    Tika version 0.2 and is based on the
-    {{{http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html}GZIPInputStream}}
-    class in the Java 5 class library.
-
-    The known gzip glob patterns are <<<*.tgz>>>, <<<*.gz>>> and <<<*-gz>>>,
-    and they will respectively be replaced with <<<*.tar>>>, <<<*>>> and
-    <<<*>>> as described above.
-
-   [bzip2 compression (application/x-bzip)]
-    {{{http://en.wikipedia.org/wiki/Bzip2}Bzip2}} support was added in
-    Tika version 0.2 and is based on bzip2 parsing code from
-    {{{http://ant.apache.org/}Apache Ant}}, which in turn was originally
-    based on work by Keiron Liddle from Aftex Software.
-
-    The known bzip2 glob patterns are <<<*.tbz>>>, <<<*.tbz2>>>, <<<*.bz>>>
-    and <<<*.bz2>>>, and they will respectively be replaced with <<<*.tar>>>,
-    <<<*.tar>>>, <<<*>>> and <<<*>>> as described above.
-
-* Audio formats
+* {Audio formats}
 
    Tika can detect several common audio formats and extract metadata
-   from them. Text extraction is supported for some MIDI-based karaoke
-   formats that contain the lyrics of the encoded audio.
-
-   See {{{https://issues.apache.org/jira/browse/TIKA-94}TIKA-94}} for
-   an effort to integrate speech recognition support to Tika.
-
-   [MP3 Audio (audio/mpeg)]
-    The parsing of {{{http://www.id3.org/ID3v1}ID3v1}} tags from MP3 files
-    was added in Tika version 0.2. If found the following metadata is
-    extracted and set:
-
-      * <<<TITLE>>> Title
-
-      * <<<SUBJECT>>> Subject
-
-    The above information, as well as the <<<Album>>>, <<<Track>>>,
-    <<<Year>>>, <<<Genre>>> and additional <<<Comment>>> are extracted
-    when set in the file.
-
-   [MIDI audio (audio/midi)]
-    Tika uses the MIDI support in <<<javax.audio.midi>>> to parse MIDI
-    sequence files. Many karaoke file formats are based on MIDI, and
-    contain lyrics as embedded text tracks that Tika knows how to extract.
-
-    Support for MIDI files was added in Tika 0.3.
-
-   [Wave audio (audio/basic)]
-    Tika supports sampled wave audio (.wav files, etc.) using the
-    <<<javax.audio.sampled>>> package. Only sampling metadata is extracted.
-
-    Support for sampled wave audio was added in Tika 0.3. 
-
-* Other supported formats
-
-   [Extensible Markup Language (application/xml)]
-    Tika uses the <<<javax.xml>>> classes to parse Extensible Markup Language files.
-    Support for Extensible Markup Language files was added in Tika 0.1.
-
-   [HyperText Markup Language (text/html)]
-    Tika uses the {{{http://sourceforge.net/projects/nekohtml}CyberNeko}} library to parse HyperText Markup Language files.
-    Support for HyperText Markup Language files was added in Tika 0.1.
-
-   [Images (image/*)]
-    Tika uses the <<<javax.imageio>>> classes to extract metadata
-    from image files.
-
-    Support for Image files was added in Tika 0.2.
-
-   [Java class files]
-    The parsing of Java Class files is based on the asm library and
-    work by Dave Brosius in JCR-1522.
-
-    Support for Java Class files was added in Tika 0.2.
-
-   [Java jar archives]
-    The parsing of Java JAR archives is performed using a combination of
-    the ZIP and Java class file parsers.
-
-    Support for Java JAR archives was added in Tika 0.2.
-
-   [OpenDocument (application/vnd.oasis.opendocument.*)]
-    Tika uses the built-in ZIP and XML features in Java to parse the
-    {{{http://en.wikipedia.org/wiki/OpenDocument}OpenDocument}} document types
-    used most notably by OpenOffice 2.0 and higher. The older OpenOffice 1.0
-    formats are also supported, though they are currently not auto-detected
-    as well as the newer formats.
-
-    Support for the OpenDocument formats was added in Tika 0.3.
-
-   [Plain text (text/plain)]
-    Tika uses the
-    {{{http://www.icu-project.org/}International Components for Unicode}}
-    Java library (ICU4J) to parse plain text. Support for plain text was added
-    in Tika 0.1.
-
-    Extracting text content from plain text files is actually a relatively
-    complex task due to the fact that the character encoding of the text
-    file is often unknown to the parser.
-
-    The text parser in Tika uses the ICU4J
-    {{{http://www.icu-project.org/apiref/icu4j/com/ibm/icu/text/CharsetDetector.html}CharsetDetector}}
-    class to automatically detect the character encoding of any text input.
-    As an added benefit, the ICU4J library is in some cases able to detect
-    also the language in which the text is written.
-
-    The character encoding and language of the plain text document are
-    returned as the <<<Metadata.CONTENT_ENCODING>>> and <<<Metadata.LANGUAGE>>>
-    metadata properties. If the (declared) content encoding of a text document
-    is already known to the client application, then it can be supplied as the
-    <<<Metadata.CONTENT_ENCODING>>> metadata property to the parser to
-    simplify encoding detection.
-
-   [Portable Document Format (application/pdf)]
-    Tika uses the {{{http://www.pdfbox.org}PDFBox}} library to parse
-    Portable Document Format (PDF) documents.
-
-    Support for PDF was added in Tika 0.1.
-
-   [Rich Text Format (application/rtf)]
-    Tika uses Java's built-in Swing library to parse Rich Text Format (RTF)
-    documents. Support for RTF was added in Tika 0.1.
-
-    The RTF parser in Tika uses the Swing
-    {{{http://java.sun.com/j2se/1.5.0/docs/api/javax/swing/text/rtf/RTFEditorKit.html}RTFEditorKit}}
-    class to extract all text from an RTF document as a single paragraph.
-    Document metadata extraction is currently not supported.
-
-   [tar archive (application/x-tar)]
-    Tika uses an adapted version of the tar parsing code from
-    {{{http://ant.apache.org/}Apache Ant}} to parse tar archives.
-    The tar code is originally based on work by Timothy Gerard Endres.
-
-    Support for tar archives was added in Tika 0.2.
-
-   [ZIP archive (application/zip)]
-    Tika uses Java's built-in Zip classes to parse ZIP files.
-
-    Support for ZIP was added in Tika 0.2.
+   from them. Even text extraction is supported for some audio files that
+   contain lyrics or other textual content. The
+   {{{api/org/apache/tika/parser/audio/AudioParser.html}AudioParser}}
+   and {{{api/org/apache/tika/parser/audio/MidiParser.html}MidiParser}}
+   classes use standard javax.sound features to process simple audio
+   formats, and the
+   {{{api/org/apache/tika/parser/mp3/Mp3Parser.html}Mp3Parser}} class
+   adds support for the widely used MP3 format.
+
+* {Image formats}
+
+   The {{{api/org/apache/tika/parser/image/ImageParser.html}ImageParser}}
+   class uses the standard javax.imageio feature to extract simple metadata
+   from image formats supported by the Java platform. More complex image
+   metadata is available through the
+   {{{api/org/apache/tika/parser/jpeg/JpegParser.html}JpegParser}} class
+   that uses the metadata-extractor library to supports Exif metadata
+   extraction from Jpeg images.
+
+* {Video formats}
+
+   Currently Tika only supports the Flash video format using a simple
+   parsing algorithm implemented in the
+   {{{api/org/apache/tika/parser/flv/FLVParser}FLVParser}} class.
+
+* {Java class files and archives}
+
+   The {{{api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class
+   extracts class names and method signatures from Java class files, and
+   the {{{api/org/apache/tika/parser/pkg/ZipParser.html}ZipParser}} class
+   supports also jar archives.
+
+* {The mbox format}
+
+   The {{{api/org/apache/tika/parser/mbox/MboxParser.html}MboxParser}} can
+   extract email messages from the mbox format used by many email archives
+   and Unix-style mailboxes.

Modified: lucene/tika/site/src/site/apt/0.6/index.apt
URL: http://svn.apache.org/viewvc/lucene/tika/site/src/site/apt/0.6/index.apt?rev=904936&r1=904893&r2=904936&view=diff
==============================================================================
--- lucene/tika/site/src/site/apt/0.6/index.apt (original)
+++ lucene/tika/site/src/site/apt/0.6/index.apt Sun Jan 31 00:37:55 2010
@@ -1,5 +1,5 @@
                        ---------------
-                       Apache Tika 0.5
+                       Apache Tika 0.6
                        ---------------
 
 ~~ Licensed to the Apache Software Foundation (ASF) under one or more
@@ -17,84 +17,96 @@
 ~~ See the License for the specific language governing permissions and
 ~~ limitations under the License.
 
-Apache Tika 0.5
+Apache Tika 0.6
 
-   The most notable changes in Tika 0.5 over the previous release are:
+   The most notable changes in Tika 0.6 over the previous release are:
 
-      * Improved RDF/OWL mime detection using both MIME magic as well as
-        pattern matching.
-        ({{{https://issues.apache.org/jira/browse/TIKA-309}TIKA-309}})
-
-      * An org.apache.tika.Tika facade class has been added to simplify
-        common text extraction and type detection use cases.
-        ({{{https://issues.apache.org/jira/browse/TIKA-269}TIKA-269}})
-
-      * A new parse context argument was added to the Parser.parse() method.
-        This context map can be used to pass things like a delegate parser
-        or other settings to the parsing process. The previous parse() method
-        signature has been deprecated and will be removed in Tika 1.0.
-        ({{{https://issues.apache.org/jira/browse/TIKA-275}TIKA-275}})
-
-      * A simple ngram-based language detection mechanism has been added
-        along with predefined language profiles for 18 languages.
-        ({{{https://issues.apache.org/jira/browse/TIKA-209}TIKA-209}})
-
-      * The media type registry in Tika was synchronized with the MIME type
-        configuration in the Apache HTTP Server. Tika now knows about 1274
-        different media types and can detect 672 of those using 927 file
-        extension and 280 magic byte patterns.
-        ({{{https://issues.apache.org/jira/browse/TIKA-285}TIKA-285}})
-
-      * Tika now uses the Apache PDFBox version 0.8.0-incubating for parsing
-        PDF documents. This version is notably better than the 0.7.3 release
-        used earlier.
-        ({{{https://issues.apache.org/jira/browse/TIKA-158}TIKA-158}})
+      * Mime-type detection for HTML (and all types) has been improved,
+        allowing malformed HTML files and those HTML files that require
+        a bit more observed content before the type is properly detected,
+        are now correctly identified by the AutoDetectParser.
+        ({{{https://issues.apache.org/jira/browse/TIKA-327}TIKA-327}},
+         {{{https://issues.apache.org/jira/browse/TIKA-357}TIKA-357}},
+         {{{https://issues.apache.org/jira/browse/TIKA-366}TIKA-366}},
+         {{{https://issues.apache.org/jira/browse/TIKA-367}TIKA-367}})
+
+      * Tika now has an additional OSGi bundle packaging that includes all
+        the required parser libraries. This bundle package makes it easy to
+        use all Tika features in an OSGi environment.
+        ({{{https://issues.apache.org/jira/browse/TIKA-340}TIKA-340}},
+         {{{https://issues.apache.org/jira/browse/TIKA-342}TIKA-342}})
+
+      * The Apache POI dependency used for parsing Microsoft Office file
+        formats has been upgraded to version 3.6. The most visible
+        improvement in this version is the notably reduced ooxml jar file
+        size. The tika-app jar size is now down to 15MB from the 25MB in
+        Tika 0.5.
+        ({{{https://issues.apache.org/jira/browse/TIKA-353}TIKA-353}})
+
+      * Handling of character encoding information in input metadata and
+        HTML \<meta\> tags has been improved. When no applicable encoding
+        information is available, the encoding is detected by looking at
+        the input data.
+        ({{{https://issues.apache.org/jira/browse/TIKA-332}TIKA-332}},
+         {{{https://issues.apache.org/jira/browse/TIKA-334}TIKA-334}},
+         {{{https://issues.apache.org/jira/browse/TIKA-335}TIKA-335}},
+         {{{https://issues.apache.org/jira/browse/TIKA-341}TIKA-341}}) 
+
+      * Some document types like Excel spreadsheets contain content like
+        numbers or formulas whose exact text format depends on the current
+        locale. So far Tika has used the platform default locale in such
+        cases, but clients can now explicitly specify the locale by passing
+        a Locale instance in the parse context.
+        ({{{https://issues.apache.org/jira/browse/TIKA-125}TIKA-125}})
+
+      * The default text output encoding of the tika-app jar is now UTF-8
+        when running on Mac OS X. This is because the default encoding used
+        by Java is not compatible with the console application in Mac OS X.
+        On all other platforms the text output from tika-app still uses
+        the platform default encoding.
+        ({{{https://issues.apache.org/jira/browse/TIKA-324}TIKA-324}})
+
+      * A flash video (video/x-flv) parser has been added.
+        ({{{https://issues.apache.org/jira/browse/TIKA-328}TIKA-328}})
+ 
+      * The handling of Number and Date cell formatting within the
+        Microsoft Excel documents has been added. This include currencies,
+        percentages and scientific formats.
+        ({{{https://issues.apache.org/jira/browse/TIKA-103}TIKA-103}})
 
-   The following people have contributed to Tika 0.5 by submitting or
+   The following people have contributed to Tika 0.6 by submitting or
    commenting on the issues resolved in this release:
 
-      * Alex Baranov
+      * Andrzej Bialecki
 
-      * Bart Hanssens
-
-      * Benson Margulies
+      * Bertrand Delacretaz
 
       * Chris A. Mattmann
 
-      * Daan de Wit
+      * Dave Meikle
 
       * Erik Hetzner
 
-      * Frank Hellwig
-
-      * Jeff Cadow
+      * Felix Meschberger
 
-      * Joachim Zittmayr
-
-      * Jukka Zitting 
+      * Jukka Zitting
 
       * Julien Nioche
 
-      * Ken Krugler
+      * Ken Krugler  
+
+      * Luke Nezda
 
       * Maxim Valyanskiy
 
-      * MRIT64
+      * Niall Pemberton
 
-      * Paul Borgermans
+      * Peter Wolanin 
 
       * Piotr B.
 
-      * Robert Newson
-
-      * Sascha Szott
-
-      * Ted Dunning
-
-      * Thilo Goetz
-
-      * Uwe Schindler
+      * Sami Siren
 
       * Yuan-Fang Li
 
-   See {{http://tinyurl.com/yl9prwp}} for more details on these contributions.
+   See {{http://tinyurl.com/yc3dk67}} for more details on these contributions.

Copied: lucene/tika/site/src/site/apt/0.6/parser.apt (from r904893, lucene/tika/site/src/site/apt/0.5/documentation.apt)
URL: http://svn.apache.org/viewvc/lucene/tika/site/src/site/apt/0.6/parser.apt?p2=lucene/tika/site/src/site/apt/0.6/parser.apt&p1=lucene/tika/site/src/site/apt/0.5/documentation.apt&r1=904893&r2=904936&rev=904936&view=diff
==============================================================================
--- lucene/tika/site/src/site/apt/0.5/documentation.apt (original)
+++ lucene/tika/site/src/site/apt/0.6/parser.apt Sun Jan 31 00:37:55 2010
@@ -1,6 +1,6 @@
-                       -------------------------
-                       Apache Tika Documentation
-                       -------------------------
+                       --------------------
+                       The Parser interface
+                       --------------------
 
 ~~ Licensed to the Apache Software Foundation (ASF) under one or more
 ~~ contributor license agreements.  See the NOTICE file distributed with
@@ -17,13 +17,10 @@
 ~~ See the License for the specific language governing permissions and
 ~~ limitations under the License.
 
-Apache Tika Documentation
-
-   This document describes the key abstractions and usage of Apache Tika.
-
 The Parser interface
 
-   The {{{apidocs/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
+   The
+   {{{api/org/apache/tika/parser/Parser.html}org.apache.tika.parser.Parser}}
    interface is the key concept of Apache Tika. It hides the complexity of
    different file formats and parsing libraries while providing a simple and
    powerful mechanism for client applications to extract structured text
@@ -31,12 +28,15 @@
    with a single method:
 
 ---
-void parse(InputStream stream, ContentHandler handler, Metadata metadata)
-    throws IOException, SAXException, TikaException;
+void parse(
+    InputStream stream, ContentHandler handler, Metadata metadata,
+    ParseContext context) throws IOException, SAXException, TikaException;
 ---
 
    The <<<parse>>> method takes the document to be parsed and related metadata
    as input and outputs the results as XHTML SAX events and extra metadata.
+   The parse context argument is used to specify context information (like
+   the current local) that is not related to any individual document.
    The main criteria that lead to this design were:
 
    [Streamed parsing] The interface should require neither the client
@@ -59,11 +59,17 @@
      formats contain metadata like the name of the author that may be useful
      to client applications.
 
+   [Context sensitivity] While the default settings and behaviour of Tika
+     parsers should work well for most use cases, there are still situations
+     where more fine-grained control over the parsing process is desirable.
+     It should be easy to inject such context-specific information to the
+     parsing process without breaking the layers of abstraction.
+
    []
 
    These criteria are reflected in the arguments of the <<<parse>>> method.
 
-Document input stream
+* Document input stream
 
    The first argument is an
    {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/InputStream.html}InputStream}}
@@ -73,7 +79,7 @@
    {{{http://java.sun.com/j2se/1.5.0/docs/api/java/io/IOException.html}IOException}}
    is passed up to the client application. If the stream can be read but
    not parsed (for example if the document is corrupted), then the parser
-   throws a {{{apidocs/org/apache/tika/exception/TikaException.html}TikaException}}.
+   throws a {{{api/org/apache/tika/exception/TikaException.html}TikaException}}.
 
    The parser implementation will consume this stream but <will not close it>.
    Closing the stream is the responsibility of the client application that
@@ -98,7 +104,7 @@
    {{{https://issues.apache.org/jira/browse/TIKA-153}TIKA-153}} for the status
    of this feature request.
 
-XHTML SAX events
+* XHTML SAX events
 
    The parsed content of the document stream is returned to the client
    application as a sequence of XHTML SAX events. XHTML is used to express
@@ -131,12 +137,12 @@
    {{{apidocs/org/apache/tika/sax/XHTMLContentHandler.html}XHTMLContentHandler}}
    utility class to generate the XHTML output.
 
-   Dealing with the raw SAX events can be a bit complex, so Apache Tika (since
-   version 0.2) comes with a number of utility classes that can be used to
-   process and convert the event stream to other representations.
+   Dealing with the raw SAX events can be a bit complex, so Apache Tika
+   comes with a number of utility classes that can be used to process and
+   convert the event stream to other representations.
 
    For example, the
-   {{{apidocs/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
+   {{{api/org/apache/tika/sax/BodyContentHandler.html}BodyContentHandler}}
    class can be used to extract just the body part of the XHTML output and
    feed it either as SAX events to another content handler or as characters
    to an output stream, a writer, or simply a string. The following code
@@ -149,7 +155,7 @@
 ---
 
    Another useful class is
-   {{{apidocs/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
+   {{{api/org/apache/tika/parser/ParsingReader.html}ParsingReader}} that
    uses a background thread to parse the document and returns the extracted
    text content as a character stream:
 
@@ -163,11 +169,11 @@
 }
 ---
 
-Document metadata
+* Document metadata
 
-   The final argument to the <<<parse>>> method is used to pass document
+   The third argument to the <<<parse>>> method is used to pass document
    metadata both in and out of the parser. Document metadata is expressed
-   as an {{{apidocs/org/apache/tika/metadata/Metadata.html}Metadata}} object.
+   as an {{{api/org/apache/tika/metadata/Metadata.html}Metadata}} object.
 
    The following are some of the more interesting metadata properties:
 
@@ -206,7 +212,19 @@
    team, and it is likely that there will be some (backwards incompatible)
    changes in metadata handling before Tika 1.0.
 
-Parser implementations
+* Parse context
+
+   The final argument to the <<<parse>>> method is used to inject
+   context-specific information to the parsing process. This is useful
+   for example when dealing with locale-specific date and number formats
+   in Microsoft Excel spreadsheets. Another important use of the parse
+   context is passing in the delegate parser instance to be used by
+   two-phase parsers like the
+   {{{api/org/apache/parser/pkg/PackageParser.html}PackageParser}} subclasses.
+   Some parser classes allow customization of the parsing process through
+   strategy objects in the parse context.
+
+* Parser implementations
 
    Apache Tika comes with a number of parser classes for parsing
    {{{formats.html}various document formats}}. You can also extend Tika

Modified: lucene/tika/site/src/site/site.xml
URL: http://svn.apache.org/viewvc/lucene/tika/site/src/site/site.xml?rev=904936&r1=904935&r2=904936&view=diff
==============================================================================
--- lucene/tika/site/src/site/site.xml (original)
+++ lucene/tika/site/src/site/site.xml Sun Jan 31 00:37:55 2010
@@ -40,7 +40,13 @@
       <item name="Download" href="download.html"/>
     </menu>
     <menu name="Documentation">
-      <item name="Tika 0.5" href="0.5/index.html">
+      <item name="Tika 0.6" href="0.6/index.html">
+        <item name="Getting Started" href="0.6/gettingstarted.html"/>
+        <item name="Supported Formats" href="0.6/formats.html"/>
+        <item name="Parser API" href="0.6/parser.html"/>
+        <item name="API Documentation" href="0.6/api/"/>
+      </item>
+      <item name="Tika 0.5" href="0.5/index.html" collapse="true">
         <item name="Getting Started" href="0.5/gettingstarted.html"/>
         <item name="Documentation" href="0.5/documentation.html"/>
         <item name="Supported Formats" href="0.5/formats.html"/>