You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2008/09/16 22:33:32 UTC
svn commit: r696042 - /incubator/tika/trunk/src/site/apt/formats.apt

Author: jukka
Date: Tue Sep 16 13:33:31 2008
New Revision: 696042

URL: http://svn.apache.org/viewvc?rev=696042&view=rev
Log:
TIKA-157: List all the document formats supported by Tika

More format documentation

Modified:
    incubator/tika/trunk/src/site/apt/formats.apt

Modified: incubator/tika/trunk/src/site/apt/formats.apt
URL: http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/apt/formats.apt?rev=696042&r1=696041&r2=696042&view=diff
==============================================================================
--- incubator/tika/trunk/src/site/apt/formats.apt (original)
+++ incubator/tika/trunk/src/site/apt/formats.apt Tue Sep 16 13:33:31 2008
@@ -22,13 +22,39 @@
    This page lists all the document formats supported by Apache Tika.
 
    [bzip2 compression (application/x-bzip)]
-    TODO
+    Tika uses an adapted version of the bzip2 parsing code from
+    {{{http://ant.apache.org/}Apache Ant}} to decompress bzip2 streams.
+    The bzip2 code is originally based on work by Keiron Liddle from
+    Aftex Software. Support for bzip2 compression was added in Tika 0.2.
+
+    The bzip2 parser decompresses the incoming stream and passes the
+    resulting stream to a configured delegate parser. If the
+    <<<RESOURCE_NAME_KEY>>> metadata property is set to a file name
+    that matches the common patterns <<<*.{tbz2,tbz}>>> or <<<*.{bz2,bz}>>>,
+    then name is replaced with <<<*.tar>>> or <<<*>>> respectively before
+    passing the decompressed stream for further parsing.
+
+    Bzip2 compression is automatically detected based on a magic header
+    or glob patterns.
 
    [Extensible Markup Language (application/xml)]
     TODO
 
    [gzip compression (application/x-gzip)]
-    TODO
+    Tika uses Java's built-in gzip support to decompress gzip streams.
+    Support for gzip compression was added in Tika 0.2.
+
+    The gzip parser simply uses the
+    {{{http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html}GZIPInputStream}}
+    class to decompress the incoming stream. The resulting stream is
+    passed to a configured delegate parser. If the
+    <<<RESOURCE_NAME_KEY>>> metadata property is set to a file name
+    that matches the common patterns <<<*.tgz2>>> or <<<*{.gz,-gz}>>>,
+    then name is replaced with <<<*.tar>>> or <<<*>>> respectively before
+    passing the decompressed stream for further parsing.
+
+    Gzip compression is automatically detected based on a magic header
+    or glob patterns.
 
    [HyperText Markup Language (text/html)]
     TODO
@@ -61,6 +87,9 @@
     upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
     for the current status of this issue.
 
+    Microsoft Word documents are automatically detected based on a magic
+    header or a glob pattern.
+
     For an example of parsing Microsoft Word files, see the
     {{{xref-test/org/apache/tika/parser/microsoft/WordParserTest.html}WordParserTest}}
     test case.
@@ -88,6 +117,9 @@
     upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
     for the current status of this issue.
 
+    Microsoft Excel spreadsheets are automatically detected based on a magic
+    header or a glob pattern.
+
     For an example of parsing Microsoft Excel files, see the
     {{{xref-test/org/apache/tika/parser/microsoft/ExcelParserTest.html}ExcelParserTest}}
     test case.
@@ -112,6 +144,9 @@
     upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
     for the current status of this issue.
 
+    Microsoft PowerPoint presentations are automatically detected based on
+    a magic header or a glob pattern.
+
     For an example of parsing Microsoft PowerPoint files, see the
     {{{xref-test/org/apache/tika/parser/microsoft/PowerPointParserTest.html}PowerPointParserTest}}
     test case.
@@ -131,6 +166,9 @@
     Generic Microsoft Office document properties like title, author, and
     keywords are returned as metadata properties.
 
+    Microsoft Visio diagrams are automatically detected based on a magic
+    header or a glob pattern.
+
    [Microsoft Outlook (application/vnd.ms-outlook)]
     Tika uses the {{{http://poi.apache.org/hsmf/}HSMF}} API in
     {{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
@@ -140,6 +178,9 @@
     the From, To, Cc, and Bcc addresses (formatted for display) along
     with the body text of text/plain messages.
 
+    Microsoft Outlook messages are automatically detected based on a magic
+    header or a glob pattern.
+
     For an example of parsing Microsoft Outlook files, see the
     {{{xref-test/org/apache/tika/parser/microsoft/OutlookParserTest.html}OutlookParserTest}}
     test case.
@@ -174,7 +215,9 @@
     simplify encoding detection.
 
    [Portable Document Format (application/pdf)]
-    TODO
+    Tika uses the {{{http://www.pdfbox.org}PDFBox}} library to parse
+    Portable Document Format (PDF) documents. Support for PDF was added
+    in Tika 0.1.
 
    [Rich Text Format (application/rtf)]
     Tika uses Java's built-in Swing library to parse Rich Text Format (RTF)
@@ -186,7 +229,10 @@
     Document metadata extraction is currently not supported.
 
    [tar archive (application/x-tar)]
-    TODO
+    Tika uses an adapted version of the tar parsing code from
+    {{{http://ant.apache.org/}Apache Ant}} to parse tar archives.
+    The tar code is originally based on work by Timothy Gerard Endres.
+    Support for tar archives was added in Tika 0.2.
 
    [ZIP archive (application/zip)]
     TODO