You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2017/04/05 13:13:48 UTC

[Tika Wiki] Update of "MSOfficeParsers" by TimothyAllison

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "MSOfficeParsers" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/MSOfficeParsers?action=diff&rev1=3&rev2=4

  = Tika's MSOffice Parsers (Apache POI) =
  
- == Experimental SAX Parser for .docx ==
+ == Experimental SAX Parser for .docx and .pptx ==
  
- As of Tika 1.15, there is an experimental SAX parser for .docx files.  On very large files (e.g. "War and Peace"), this parser appears to be 4x faster and require far less memory than our traditional DOM based parser.  For smaller files, the gain is not nearly as great, but it is still faster.  This parser is still in its early stages and doesn't have all of the features of the DOM parser.  However, it does offer parameterization to include or exclude deleted text.
+ As of Tika 1.15, there are experimental SAX parsers for .docx files.  On very large files (e.g. "War and Peace"), this parser appears to be 4x faster and require far less memory than our traditional DOM based parsers.  For smaller files, the gain is not nearly as great.  For the 386MB pptx submitted on TIKA-2201, it would have taken ~60GB to load the file in memory.
  
+ These parsers are still in their early stages and don't have all of the features of the DOM parsers.  However, the .docx parser does offer parameterization to include or exclude deleted text.
+ 
- To select it programmatically, set `setUseSAXDocxExtractor` to `true` on an OfficeParserConfig and put that in the ParseContext: `context.set(OfficeParserConfig.class, officeParserConfig);`.
+ To select it programmatically, set `setUseSAXDocxExtractor` or `setUsetSAXPptxExtractor` to `true` on an OfficeParserConfig and put that in the ParseContext: `context.set(OfficeParserConfig.class, officeParserConfig);`.
  
  To set it via the config file, try:
  
@@ -17, +19 @@

          <parser class="org.apache.tika.parser.microsoft.ooxml.OOXMLParser">
              <params>
                  <param name="useSAXDocxExtractor" type="bool">true</param>
+                 <param name="useSAXPptxExtractor" type="bool">true</param>
              </params>
          </parser>
      </parsers>
  </properties>
  }}}
  
- See [[https://issues.apache.org/jira/browse/TIKA-1321|TIKA-1321]] for the parser and [[https://issues.apache.org/jira/browse/TIKA-2180|TIKA-2180]] for some symptoms that the current DOM parser might be slowing you down.
+ See [[https://issues.apache.org/jira/browse/TIKA-1321|TIKA-1321]] for the parser and [[https://issues.apache.org/jira/browse/TIKA-2180|TIKA-2180]] and [[https://issues.apache.org/jira/browse/TIKA-2201|TIKA-2201]] for some symptoms that the current DOM parser might be slowing you down.
-  
+ 
+ 
+ 
  == How to build Tika with POI's trunk ==
  
  You'll need to have the following build tools installed: Ant, Forrest and Maven.