You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ju...@apache.org on 2008/09/16 23:41:08 UTC
svn commit: r696078 - /incubator/tika/trunk/src/site/apt/formats.apt
Author: jukka
Date: Tue Sep 16 14:41:08 2008
New Revision: 696078
URL: http://svn.apache.org/viewvc?rev=696078&view=rev
Log:
TIKA-157: List all the document formats supported by Tika
A pretty comprehensive section on general purpose compression.
Modified:
incubator/tika/trunk/src/site/apt/formats.apt
Modified: incubator/tika/trunk/src/site/apt/formats.apt
URL: http://svn.apache.org/viewvc/incubator/tika/trunk/src/site/apt/formats.apt?rev=696078&r1=696077&r2=696078&view=diff
==============================================================================
--- incubator/tika/trunk/src/site/apt/formats.apt (original)
+++ incubator/tika/trunk/src/site/apt/formats.apt Tue Sep 16 14:41:08 2008
@@ -21,40 +21,53 @@
This page lists all the document formats supported by Apache Tika.
- [bzip2 compression (application/x-bzip)]
- Tika uses an adapted version of the bzip2 parsing code from
- {{{http://ant.apache.org/}Apache Ant}} to decompress bzip2 streams.
- The bzip2 code is originally based on work by Keiron Liddle from
- Aftex Software. Support for bzip2 compression was added in Tika 0.2.
-
- The bzip2 parser decompresses the incoming stream and passes the
- resulting stream to a configured delegate parser. If the
- <<<RESOURCE_NAME_KEY>>> metadata property is set to a file name
- that matches the common patterns <<<*.{tbz2,tbz}>>> or <<<*.{bz2,bz}>>>,
- then name is replaced with <<<*.tar>>> or <<<*>>> respectively before
- passing the decompressed stream for further parsing.
-
- Bzip2 compression is automatically detected based on a magic header
- or glob patterns.
+* Compression formats
- [Extensible Markup Language (application/xml)]
- TODO
+ General purpose compression formats are used to reduce the size of
+ any kinds of documents. Tika uses a parsing pipeline to support general
+ purpose compression: in the first stage the compressed stream decompressed
+ and the resulting decompressed stream is passed on to a second parsing
+ stage where it will be processed as if the document had never been
+ compressed.
+
+ Tika contains magic numbers and glob patterns for auto-detecting all
+ supported compression formats. The glob patterns of compression formats
+ are also used to determine the name of the original uncompressed document.
+ If a client application has supplied a <<<RESOURCE_NAME_KEY>>> metadata
+ property that matches such a glob pattern, then the decompressing first
+ parsing stage will replace the <<<RESOURCE_NAME_KEY>>> metadata property
+ with the deduced original document name before passing control to the
+ second parsing stage.
+
+ Note that apart from the special handling of the <<<RESOURCE_NAME_KEY>>>
+ property, no document metadata is passed to or from the second parsing
+ stage. Only the text content extracted by the second stage parser is
+ returned to the client application.
[gzip compression (application/x-gzip)]
- Tika uses Java's built-in gzip support to decompress gzip streams.
- Support for gzip compression was added in Tika 0.2.
-
- The gzip parser simply uses the
+ {{{http://en.wikipedia.org/wiki/Gzip}Gzip}} support was added in
+ Tika version 0.2 and is based on the
{{{http://java.sun.com/j2se/1.5.0/docs/api/java/util/zip/GZIPInputStream.html}GZIPInputStream}}
- class to decompress the incoming stream. The resulting stream is
- passed to a configured delegate parser. If the
- <<<RESOURCE_NAME_KEY>>> metadata property is set to a file name
- that matches the common patterns <<<*.tgz2>>> or <<<*{.gz,-gz}>>>,
- then name is replaced with <<<*.tar>>> or <<<*>>> respectively before
- passing the decompressed stream for further parsing.
+ class in the Java 5 class library.
+
+ The known gzip glob patterns are <<<*.tgz>>>, <<<*.gz>>> and <<<*-gz>>>,
+ and they will respectively be replaced with <<<*.tar>>>, <<<*>>> and
+ <<<*>>> as described above.
- Gzip compression is automatically detected based on a magic header
- or glob patterns.
+ [bzip2 compression (application/x-bzip)]
+ {{{http://en.wikipedia.org/wiki/Bzip2}Bzip2}} support was added in
+ Tika version 0.2 and is based on bzip2 parsing code from
+ {{{http://ant.apache.org/}Apache Ant}}, which in turn was originally
+ based on work by Keiron Liddle from Aftex Software.
+
+ The known bzip2 glob patterns are <<<*.tbz>>>, <<<*.tbz2>>>, <<<*.bz>>>
+ and <<<*.bz2>>>, and they will respectively be replaced with <<<*.tar>>>,
+ <<<*.tar>>>, <<<*>>> and <<<*>>> as described above.
+
+* Other supported formats
+
+ [Extensible Markup Language (application/xml)]
+ TODO
[HyperText Markup Language (text/html)]
TODO