You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ni...@apache.org on 2016/02/19 12:48:42 UTC
svn commit: r1731227 - in /tika/site: publish/1.12/formats.html
src/site/apt/1.12/formats.apt
Author: nick
Date: Fri Feb 19 11:48:42 2016
New Revision: 1731227
URL: http://svn.apache.org/viewvc?rev=1731227&view=rev
Log:
Add something on new parsers for 1.12
Modified:
tika/site/publish/1.12/formats.html
tika/site/src/site/apt/1.12/formats.apt
Modified: tika/site/publish/1.12/formats.html
URL: http://svn.apache.org/viewvc/tika/site/publish/1.12/formats.html?rev=1731227&r1=1731226&r2=1731227&view=diff
==============================================================================
--- tika/site/publish/1.12/formats.html (original)
+++ tika/site/publish/1.12/formats.html Fri Feb 19 11:48:42 2016
@@ -144,7 +144,8 @@
<p>The <a href="./api/org/apache/tika/parser/rtf/RTFParser.html">RTFParser</a> class uses the standard javax.swing.text.rtf feature to extract text content from Rich Text Format (RTF) documents.</p></div>
<div class="section">
<h3><a name="Compression_and_packaging_formats">Compression and packaging formats</a></h3>
-<p>Tika uses the <a class="externalLink" href="http://commons.apache.org/compress/">Commons Compress</a> library to support various compression and packaging formats. The <a href="./api/org/apache/tika/parser/pkg/CompressorParser.html">CompressorParser</a> class handles parsing of the top level compression formats, then <a href="./api/org/apache/tika/parser/pkg/PackageParser.html">PackageParser</a> class and its subclasses parse the packaging formats and then pass the unpacked document streams to a second parsing stage using the parser instance specified in the parse context. Formats supported include Tar, RAR, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200.</p></div>
+<p>Tika uses the <a class="externalLink" href="http://commons.apache.org/compress/">Commons Compress</a> library to support various compression and packaging formats. The <a href="./api/org/apache/tika/parser/pkg/CompressorParser.html">CompressorParser</a> class handles parsing of the top level compression formats, then <a href="./api/org/apache/tika/parser/pkg/PackageParser.html">PackageParser</a> class and its subclasses parse the packaging formats and then pass the unpacked document streams to a second parsing stage using the parser instance specified in the parse context. Formats supported include Tar, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200.</p>
+<p>Additionally, the <a href="./api/org/apache/tika/parser/pkg/RARParser.html">RARParser</a> class supports the RAR archive format, which isn't supported by Commons Compress.</p></div>
<div class="section">
<h3><a name="Text_formats">Text formats</a></h3>
<p>Extracting text content from plain text files seems like a simple task until you start thinking of all the possible character encodings. The <a href="./api/org/apache/tika/parser/txt/TXTParser.html">TXTParser</a> class uses encoding detection code from the <a class="externalLink" href="http://site.icu-project.org/">ICU</a> project to automatically detect the character encoding of a text document.</p></div>
@@ -166,7 +167,8 @@
<h3><a name="Video_formats">Video formats</a></h3>
<p>Tika supports the Flash video format using a simple parsing algorithm implemented in the <a href="./api/org/apache/tika/parser/video/FLVParser">FLVParser</a> class.</p>
<p>The MP4 family of video formats (MP4, Quicktime, 3GPP etc) is supported by the <a href="./api/org/apache/tika/parser/mp4/MP4Parser">MP4Parser</a> class, which extracts metadata on the video, along with audio stream (if present).</p>
-<p>For the Ogg family of video formats, a limited amount of metadata is extracted by the <a href="./api/org/gagravarr/tika/OggParser.html">OggParser</a> class.</p></div>
+<p>For the Ogg family of video formats, a limited amount of metadata is extracted by the <a href="./api/org/gagravarr/tika/OggParser.html">OggParser</a> class.</p>
+<p>As an alternative, the <a href="./api/org/apache/tika/parser/pot/PooledTimeSeriesParser">PooledTimeSeriesParser</a> can be used (if the required tool is installed) to generate a numeric representation of the video suitable for similarity searches. More details on this approach, and setup instructions for the parser + tool, can be found on <a class="externalLink" href="https://wiki.apache.org/tika/PooledTimeSeriesParser">the Tika wiki page for the parser</a>.</p></div>
<div class="section">
<h3><a name="Java_class_files_and_archives">Java class files and archives</a></h3>
<p>The <a href="./api/org/apache/tika/parser/asm/ClassParser">ClassParser</a> class extracts class names and method signatures from Java class files, and the <a href="./api/org/apache/tika/parser/pkg/ZipParser.html">ZipParser</a> class supports also jar archives.</p></div>
@@ -383,7 +385,7 @@
</div>
<div id="footer">
<p>
- Copyright © 2015
+ Copyright © 2016
<a href="http://www.apache.org/">The Apache Software Foundation</a>.
Site powered by <a href="http://maven.apache.org/">Apache Maven</a>.
Search powered by
Modified: tika/site/src/site/apt/1.12/formats.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.12/formats.apt?rev=1731227&r1=1731226&r2=1731227&view=diff
==============================================================================
--- tika/site/src/site/apt/1.12/formats.apt (original)
+++ tika/site/src/site/apt/1.12/formats.apt Fri Feb 19 11:48:42 2016
@@ -113,7 +113,11 @@ Supported Document Formats
class and its subclasses parse the packaging formats and then pass the
unpacked document streams to a second parsing stage using the parser
instance specified in the parse context. Formats supported include Tar,
- RAR, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200.
+ AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200.
+
+ Additionally, the
+ {{{./api/org/apache/tika/parser/pkg/RARParser.html}RARParser}} class
+ supports the RAR archive format, which isn't supported by Commons Compress.
* {Text formats}
@@ -193,6 +197,14 @@ Supported Document Formats
extracted by the
{{{./api/org/gagravarr/tika/OggParser.html}OggParser}} class.
+ As an alternative, the
+ {{{./api/org/apache/tika/parser/pot/PooledTimeSeriesParser}PooledTimeSeriesParser}}
+ can be used (if the required tool is installed) to generate a numeric
+ representation of the video suitable for similarity searches. More details
+ on this approach, and setup instructions for the parser + tool, can be
+ found on {{{https://wiki.apache.org/tika/PooledTimeSeriesParser}the Tika
+ wiki page for the parser}}.
+
* {Java class files and archives}
The {{{./api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class