You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@tika.apache.org by Apache Wiki <wi...@apache.org> on 2019/04/03 22:16:38 UTC

[Tika Wiki] Update of "MSOfficeParsers" by TimothyAllison

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change notification.

The "MSOfficeParsers" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/MSOfficeParsers?action=diff&rev1=7&rev2=8

  
  == Beta SAX Parsers for .docx and .pptx ==
  
- As of Tika 1.15, there are experimental/beta SAX parsers for .docx files.  On very large files (e.g. "War and Peace"), this parser appears to be 4x faster and require far less memory than our traditional DOM based parsers.  For smaller files, the gain is not nearly as great.  For the 386MB pptx submitted on TIKA-2201, it would have taken ~60GB to load the file in memory.
+ As of Tika 1.15, there are experimental/beta SAX parsers for .docx files.  On very large files (e.g. "War and Peace"), this parser appears to be 4x faster and require far less memory than our traditional DOM based parsers.  For smaller files, the gain is not nearly as great.  For the 386MB pptx submitted on TIKA-2201, it would have taken ~60GB to load the file in memory.  See also, the DOCX file submitted on TIKA-2847, which is only 3.6MB, but decompresses to ~100MB.
  
  These parsers are still in their early stages and don't have all of the features of the DOM parsers.  However, the .docx parser does offer parameterization to include or exclude deleted text.