You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2017/04/04 12:29:02 UTC

bean-free ooxml streaming readers?

Thank you, Javen.

On Tika, we now have mostly bean-free SAX-based parsers for docx and pptx.

Is there any interest in moving those into POI and potentially working towards creating a new read-only module that does not depend on ooxml-schemas.

Not for 3.16, obviously...

Cheers,

               Tim

-----Original Message-----
From: Javen O'Neal [mailto:onealj@apache.org] 
Sent: Tuesday, April 4, 2017 2:52 AM
To: POI Developers List <de...@poi.apache.org>
Subject: Re: POI 3.16 Final?

+1

ThreadLocal-related bugs:
https://bz.apache.org/bugzilla/buglist.cgi?bug_status=__all__&content=Threadlocal&list_id=158556&order=Importance&product=POI&query_format=specific

On Apr 3, 2017 8:15 AM, "Dominik Stadler" <do...@gmx.at> wrote:

> Hi,
>
> soonish sounds fine to me, I can start a first regression run tomorrow 
> or Wednesday to get results before we start the release process.
>
> On ThreadLocals: XMLBeans is dormant, so unfortuantely not much help 
> to be expected from there, I started some early work on 
> https://bz.apache.org/
> bugzilla/show_bug.cgi?id=59268 some time ago, but unfortunately it was 
> very unwieldly, i.e. for some reason it was hard to even identify the 
> exact version used for previous releases or build the latest version 
> cleanly.
>
> Dominik.
>
> On Mon, Apr 3, 2017 at 3:14 PM, Allison, Timothy B. 
> <ta...@mitre.org>
> wrote:
>
> > Is there anything we can do about ThreadLocal leaks in POI bug 
> > 55149/XMLBEANS-502/TIKA-1784?
> >
> > -----Original Message-----
> > From: Allison, Timothy B. [mailto:tallison@mitre.org]
> > Sent: Monday, April 3, 2017 9:06 AM
> > To: POI Developers List <de...@poi.apache.org>
> > Subject: RE: POI 3.16 Final?
> >
> > +1 for next couple days-ish.
> >
> > I'd like to finish or abandon hope on 50955.  I'll be working on 
> > that
> this
> > morning.
> >
> > -----Original Message-----
> > From: Andreas Beeker [mailto:kiwiwings@apache.org]
> > Sent: Saturday, April 1, 2017 6:46 PM
> > To: POI Developers List <de...@poi.apache.org>
> > Subject: POI 3.16 Final?
> >
> > Hi,
> >
> > how about pushing the 3.16 final out soon?
> >
> > I have quite a few changes for HPSF, which I don't want to commit to 
> > the final.
> >
> > Who will be the release manager? ... I'll be the fallback, if 
> > everyone is busy ...
> >
> > Andi
> >
> >
> > --------------------------------------------------------------------
> > - To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org For 
> > additional commands, e-mail: dev-help@poi.apache.org
> >
> >
> >
>

RE: bean-free ooxml streaming readers?

Posted by "Allison, Timothy B." <ta...@mitre.org>.
>Since it would be read-only, would it just be another option, instead of a full replacement?  

Y, think of it like XSSF's eventusermodel.  We define an interface for what a user will have to react to, like XSSFSheetXMLHandler's SheetContentsHandler, and we take care of the rest.  You can see the current example for docx [1] and pptx [2] in Tika.

> Would the data model need to be more fully fleshed out to support all the corners of the OOXML spec not currently represented?

Not that I'm aware of...but...ymmv.  In some cases, reading for some elements like "w:t" is actually more robust than traversing the DOM and requiring known structural relationships.  Bug 54849 requires us to know to look for SDT at the block level of the document [3].  We wouldn't have hit that if all we cared about were "w:t" or even "sdt" wherever they occurred.  Same is true but at a different structural level with Glossary document.  There were a handful of other examples that I stumbled upon while working on the SAX parsers in Tika.

> Is there anything at all that could help with the write side without the overhead of XMLBeans?

Not that I can think of...that'll be quite some work. 


[1] https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xwpf/XWPFEventBasedWordExtractor.java

[2] https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/ooxml/xslf/XSLFEventBasedPowerPointExtractor.java 

[3] https://bz.apache.org/bugzilla/show_bug.cgi?id=54849


Re: bean-free ooxml streaming readers?

Posted by Greg Woolsey <gr...@gmail.com>.
Late to the party as I sift through my Spring Break email backlog.

Since it would be read-only, would it just be another option, instead of a
full replacement?  Would the data model need to be more fully fleshed out
to support all the corners of the OOXML spec not currently represented?

Is there anything at all that could help with the write side without the
overhead of XMLBeans?

On Tue, Apr 4, 2017 at 8:38 AM Javen O'Neal <ja...@gmail.com> wrote:

> Absolutely!
>
> Since XMLBeans is collecting cobwebs in the attic, we've been looking for a
> replacement, so long as it doesn't grow the POI codebase too much.
>
> On Apr 4, 2017 5:29 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:
>
> Thank you, Javen.
>
> On Tika, we now have mostly bean-free SAX-based parsers for docx and pptx.
>
> Is there any interest in moving those into POI and potentially working
> towards creating a new read-only module that does not depend on
> ooxml-schemas.
>
> Not for 3.16, obviously...
>
> Cheers,
>
>                Tim
>

Re: bean-free ooxml streaming readers?

Posted by Javen O'Neal <ja...@gmail.com>.
Absolutely!

Since XMLBeans is collecting cobwebs in the attic, we've been looking for a
replacement, so long as it doesn't grow the POI codebase too much.

On Apr 4, 2017 5:29 AM, "Allison, Timothy B." <ta...@mitre.org> wrote:

Thank you, Javen.

On Tika, we now have mostly bean-free SAX-based parsers for docx and pptx.

Is there any interest in moving those into POI and potentially working
towards creating a new read-only module that does not depend on
ooxml-schemas.

Not for 3.16, obviously...

Cheers,

               Tim