You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by "Kalam, Venkata Krishna Chaitanya" <vk...@informatica.com.INVALID> on 2019/02/12 18:09:44 UTC

Event Based APIs for parsing docx,doc,pptx,ppt files

Hi team
We are trying to read the data from office  documents like xlsx, xls, docx etc.,. But we are facing memory issues while reading OOXML file formatted files,  of large size(around 100 MB) using POI apis. For xls/xlsx formats there are event based APIs which solve the memory issue(XSSF/HSSF event based API). But for reading word files or ppt files, there are no event based APIs. We have to create XWPF/HWPF Document which consumes lot of memory , ex: for 45 MB DOCX file, the heap size to prepare XWPFDocument it's taking 12GB memory.

So similar to Xlsx files, is there any plan to provide event based apis for rest of office documents.?
And if there is any workaround to read the data with less memory consumption. Please let me know? Our use case is to just read the data.

Thanks
Chaitanya

Re: Event Based APIs for parsing docx,doc,pptx,ppt files

Posted by Tim Allison <ta...@apache.org>.
I've added SAX parsers for pptx and docx over on Apache Tika.  These
rely on POI for OPCPackage, a bunch of other classes and overall
design.

I've thought about moving that code into POI, but I haven't found the
time or need, and the code is my typical kludgy-mess...and I don't
want to pollute POI any more than I have.

Take a look over on Tika and see if those will work for you.  Let me
know what you think...

References:
https://wiki.apache.org/tika/MSOfficeParsers

https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXSLFPowerPointExtractorDecorator.java

https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/SXWPFWordExtractorDecorator.java

On Thu, Feb 14, 2019 at 8:53 AM Kalam, Venkata Krishna Chaitanya
<vk...@informatica.com.invalid> wrote:
>
> Hi team
> We are trying to read the data from office  documents like xlsx, xls, docx etc.,. But we are facing memory issues while reading OOXML file formatted files,  of large size(around 100 MB) using POI apis. For xls/xlsx formats there are event based APIs which solve the memory issue(XSSF/HSSF event based API). But for reading word files or ppt files, there are no event based APIs. We have to create XWPF/HWPF Document which consumes lot of memory , ex: for 45 MB DOCX file, the heap size to prepare XWPFDocument it's taking 12GB memory.
>
> So similar to Xlsx files, is there any plan to provide event based apis for rest of office documents.?
> And if there is any workaround to read the data with less memory consumption. Please let me know? Our use case is to just read the data.
>
> Thanks
> Chaitanya

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: Event Based APIs for parsing docx,doc,pptx,ppt files

Posted by "pj.fanning" <fa...@yahoo.com>.
Noone that I know of is actively working on a streaming API for docx or pptx.
Contributions to POI in these areas would be welcome.

One low level approach is to read docx/pptx files as zip files. If they are
password protected, you can use POI to first create a copy of the file with
the password protection removed (this supports streaming).

The zip files contain XML files that have the content and the metadata (eg
style data). The XML can be parsed with SAX or StAX parsers. The XML specs
are detailed in https://en.wikipedia.org/wiki/Office_Open_XML

docx4j may be an option. As far as I know it does not support streaming but
it's possible it uses less memory when reading docx or pptx files.

For ppt and doc formats, I believe that the data formats don't lend
themselves to streaming the data.




--
Sent from: http://apache-poi.1045710.n5.nabble.com/POI-Dev-f2312866.html

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org