You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by ni...@apache.org on 2016/02/19 12:48:42 UTC

svn commit: r1731227 - in /tika/site: publish/1.12/formats.html src/site/apt/1.12/formats.apt

Author: nick
Date: Fri Feb 19 11:48:42 2016
New Revision: 1731227

URL: http://svn.apache.org/viewvc?rev=1731227&view=rev
Log:
Add something on new parsers for 1.12

Modified:
    tika/site/publish/1.12/formats.html
    tika/site/src/site/apt/1.12/formats.apt

Modified: tika/site/publish/1.12/formats.html
URL: http://svn.apache.org/viewvc/tika/site/publish/1.12/formats.html?rev=1731227&r1=1731226&r2=1731227&view=diff
==============================================================================
--- tika/site/publish/1.12/formats.html (original)
+++ tika/site/publish/1.12/formats.html Fri Feb 19 11:48:42 2016
@@ -144,7 +144,8 @@
 <p>The <a href="./api/org/apache/tika/parser/rtf/RTFParser.html">RTFParser</a> class uses the standard javax.swing.text.rtf feature to extract text content from Rich Text Format (RTF) documents.</p></div>
 <div class="section">
 <h3><a name="Compression_and_packaging_formats">Compression and packaging formats</a></h3>
-<p>Tika uses the <a class="externalLink" href="http://commons.apache.org/compress/">Commons Compress</a> library to support various compression and packaging formats. The <a href="./api/org/apache/tika/parser/pkg/CompressorParser.html">CompressorParser</a> class handles parsing of the top level compression formats, then <a href="./api/org/apache/tika/parser/pkg/PackageParser.html">PackageParser</a> class and its subclasses parse the packaging formats and then pass the unpacked document streams to a second parsing stage using the parser instance specified in the parse context. Formats supported include Tar, RAR, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200.</p></div>
+<p>Tika uses the <a class="externalLink" href="http://commons.apache.org/compress/">Commons Compress</a> library to support various compression and packaging formats. The <a href="./api/org/apache/tika/parser/pkg/CompressorParser.html">CompressorParser</a> class handles parsing of the top level compression formats, then <a href="./api/org/apache/tika/parser/pkg/PackageParser.html">PackageParser</a> class and its subclasses parse the packaging formats and then pass the unpacked document streams to a second parsing stage using the parser instance specified in the parse context. Formats supported include Tar, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200.</p>
+<p>Additionally, the <a href="./api/org/apache/tika/parser/pkg/RARParser.html">RARParser</a> class supports the RAR archive format, which isn't supported by Commons Compress.</p></div>
 <div class="section">
 <h3><a name="Text_formats">Text formats</a></h3>
 <p>Extracting text content from plain text files seems like a simple task until you start thinking of all the possible character encodings. The <a href="./api/org/apache/tika/parser/txt/TXTParser.html">TXTParser</a> class uses encoding detection code from the <a class="externalLink" href="http://site.icu-project.org/">ICU</a> project to automatically detect the character encoding of a text document.</p></div>
@@ -166,7 +167,8 @@
 <h3><a name="Video_formats">Video formats</a></h3>
 <p>Tika supports the Flash video format using a simple parsing algorithm implemented in the <a href="./api/org/apache/tika/parser/video/FLVParser">FLVParser</a> class.</p>
 <p>The MP4 family of video formats (MP4, Quicktime, 3GPP etc) is supported by the <a href="./api/org/apache/tika/parser/mp4/MP4Parser">MP4Parser</a> class, which extracts metadata on the video, along with audio stream (if present).</p>
-<p>For the Ogg family of video formats, a limited amount of metadata is extracted by the <a href="./api/org/gagravarr/tika/OggParser.html">OggParser</a> class.</p></div>
+<p>For the Ogg family of video formats, a limited amount of metadata is extracted by the <a href="./api/org/gagravarr/tika/OggParser.html">OggParser</a> class.</p>
+<p>As an alternative, the <a href="./api/org/apache/tika/parser/pot/PooledTimeSeriesParser">PooledTimeSeriesParser</a> can be used (if the required tool is installed) to generate a numeric representation of the video suitable for similarity searches. More details on this approach, and setup instructions for the parser + tool, can be found on <a class="externalLink" href="https://wiki.apache.org/tika/PooledTimeSeriesParser">the Tika wiki page for the parser</a>.</p></div>
 <div class="section">
 <h3><a name="Java_class_files_and_archives">Java class files and archives</a></h3>
 <p>The <a href="./api/org/apache/tika/parser/asm/ClassParser">ClassParser</a> class extracts class names and method signatures from Java class files, and the <a href="./api/org/apache/tika/parser/pkg/ZipParser.html">ZipParser</a> class supports also jar archives.</p></div>
@@ -383,7 +385,7 @@
       </div>
       <div id="footer">
         <p>
-          Copyright &#169; 2015
+          Copyright &#169; 2016
           <a href="http://www.apache.org/">The Apache Software Foundation</a>.
           Site powered by <a href="http://maven.apache.org/">Apache Maven</a>. 
           Search powered by

Modified: tika/site/src/site/apt/1.12/formats.apt
URL: http://svn.apache.org/viewvc/tika/site/src/site/apt/1.12/formats.apt?rev=1731227&r1=1731226&r2=1731227&view=diff
==============================================================================
--- tika/site/src/site/apt/1.12/formats.apt (original)
+++ tika/site/src/site/apt/1.12/formats.apt Fri Feb 19 11:48:42 2016
@@ -113,7 +113,11 @@ Supported Document Formats
    class and its subclasses parse the packaging formats and then pass the 
    unpacked document streams to a second parsing stage using the parser 
    instance specified in the parse context. Formats supported include Tar, 
-   RAR, AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200.
+   AR, CPIO, Zip, 7Zip, Gzip, BZip2, XZ and Pack200.
+
+   Additionally, the
+   {{{./api/org/apache/tika/parser/pkg/RARParser.html}RARParser}} class
+   supports the RAR archive format, which isn't supported by Commons Compress.
 
 * {Text formats}
 
@@ -193,6 +197,14 @@ Supported Document Formats
    extracted by the 
    {{{./api/org/gagravarr/tika/OggParser.html}OggParser}} class.
 
+   As an alternative, the
+   {{{./api/org/apache/tika/parser/pot/PooledTimeSeriesParser}PooledTimeSeriesParser}}
+   can be used (if the required tool is installed) to generate a numeric
+   representation of the video suitable for similarity searches. More details
+   on this approach, and setup instructions for the parser + tool, can be
+   found on {{{https://wiki.apache.org/tika/PooledTimeSeriesParser}the Tika
+   wiki page for the parser}}.
+
 * {Java class files and archives}
 
    The {{{./api/org/apache/tika/parser/asm/ClassParser}ClassParser}} class