You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2008/09/16 22:33:32 UTC
svn commit: r696042 - /incubator/tika/trunk/src/site/apt/formats.apt
Author: jukka
Date: Tue Sep 16 13:33:31 2008
New Revision: 696042
URL: http://svn.apache.org/viewvc?rev=696042&view=rev
Log:
TIKA-157: List all the document formats supported by Tika
More format documentation
Modified:
incubator/tika/trunk/src/site/apt/formats.apt
Modified: incubator/tika/trunk/src/site/apt/formats.apt
URL: http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/apt/formats.apt?rev=696042&r1=696041&r2=696042&view=diff
==============================================================================
--- incubator/tika/trunk/src/site/apt/formats.apt (original)
+++ incubator/tika/trunk/src/site/apt/formats.apt Tue Sep 16 13:33:31 2008
@@ -22,13 +22,39 @@
This page lists all the document formats supported by Apache Tika.
[bzip2 compression (application/x-bzip)]
- TODO
+ Tika uses an adapted version of the bzip2 parsing code from
+ {{{http://ant.apache.org/}Apache Ant}} to decompress bzip2 streams.
+ The bzip2 code is originally based on work by Keiron Liddle from
+ Aftex Software. Support for bzip2 compression was added in Tika 0.2.
+
+ The bzip2 parser decompresses the incoming stream and passes the
+ resulting stream to a configured delegate parser. If the
+ <<<RESOURCE_NAME_KEY>>> metadata property is set to a file name
+ that matches the common patterns <<<*.{tbz2,tbz}>>> or <<<*.{bz2,bz}>>>,
+ then name is replaced with <<<*.tar>>> or <<<*>>> respectively before
+ passing the decompressed stream for further parsing.
+
+ Bzip2 compression is automatically detected based on a magic header
+ or glob patterns.
[Extensible Markup Language (application/xml)]
TODO
[gzip compression (application/x-gzip)]
- TODO
+ Tika uses Java's built-in gzip support to decompress gzip streams.
+ Support for gzip compression was added in Tika 0.2.
+
+ The gzip parser simply uses the
+ {{{http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html}GZIPInputStream}}
+ class to decompress the incoming stream. The resulting stream is
+ passed to a configured delegate parser. If the
+ <<<RESOURCE_NAME_KEY>>> metadata property is set to a file name
+ that matches the common patterns <<<*.tgz2>>> or <<<*{.gz,-gz}>>>,
+ then name is replaced with <<<*.tar>>> or <<<*>>> respectively before
+ passing the decompressed stream for further parsing.
+
+ Gzip compression is automatically detected based on a magic header
+ or glob patterns.
[HyperText Markup Language (text/html)]
TODO
@@ -61,6 +87,9 @@
upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
for the current status of this issue.
+ Microsoft Word documents are automatically detected based on a magic
+ header or a glob pattern.
+
For an example of parsing Microsoft Word files, see the
{{{xref-test/org/apache/tika/parser/microsoft/WordParserTest.html}WordParserTest}}
test case.
@@ -88,6 +117,9 @@
upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
for the current status of this issue.
+ Microsoft Excel spreadsheets are automatically detected based on a magic
+ header or a glob pattern.
+
For an example of parsing Microsoft Excel files, see the
{{{xref-test/org/apache/tika/parser/microsoft/ExcelParserTest.html}ExcelParserTest}}
test case.
@@ -112,6 +144,9 @@
upgrade. See {{{https://issues.apache.org/jira/browse/TIKA-152}TIKA-152}}
for the current status of this issue.
+ Microsoft PowerPoint presentations are automatically detected based on
+ a magic header or a glob pattern.
+
For an example of parsing Microsoft PowerPoint files, see the
{{{xref-test/org/apache/tika/parser/microsoft/PowerPointParserTest.html}PowerPointParserTest}}
test case.
@@ -131,6 +166,9 @@
Generic Microsoft Office document properties like title, author, and
keywords are returned as metadata properties.
+ Microsoft Visio diagrams are automatically detected based on a magic
+ header or a glob pattern.
+
[Microsoft Outlook (application/vnd.ms-outlook)]
Tika uses the {{{http://poi.apache.org/hsmf/}HSMF}} API in
{{{http://poi.apache.org/}Apache POI}} to parse OLE2-based Microsoft
@@ -140,6 +178,9 @@
the From, To, Cc, and Bcc addresses (formatted for display) along
with the body text of text/plain messages.
+ Microsoft Outlook messages are automatically detected based on a magic
+ header or a glob pattern.
+
For an example of parsing Microsoft Outlook files, see the
{{{xref-test/org/apache/tika/parser/microsoft/OutlookParserTest.html}OutlookParserTest}}
test case.
@@ -174,7 +215,9 @@
simplify encoding detection.
[Portable Document Format (application/pdf)]
- TODO
+ Tika uses the {{{http://www.pdfbox.org}PDFBox}} library to parse
+ Portable Document Format (PDF) documents. Support for PDF was added
+ in Tika 0.1.
[Rich Text Format (application/rtf)]
Tika uses Java's built-in Swing library to parse Rich Text Format (RTF)
@@ -186,7 +229,10 @@
Document metadata extraction is currently not supported.
[tar archive (application/x-tar)]
- TODO
+ Tika uses an adapted version of the tar parsing code from
+ {{{http://ant.apache.org/}Apache Ant}} to parse tar archives.
+ The tar code is originally based on work by Timothy Gerard Endres.
+ Support for tar archives was added in Tika 0.2.
[ZIP archive (application/zip)]
TODO