You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@poi.apache.org by Andreas Beeker <ki...@apache.org> on 2018/08/07 20:54:26 UTC

Re: upgrading to 4.0.0

Hi Tim,

On 7/31/18 9:49 PM, Tim Allison wrote:
>   I'm trying to upgrade Tika to 4.0.0-SNAPSHOT.
>
> 2) To confirm OLEShape has become HSLFObjectShape?

You are correct.

Can I help you with the upgrade?

Andi



Re: upgrading to 4.0.0

Posted by Tim Allison <ta...@apache.org>.
I finally had a chance to look at *-reports-b.tgz

1) I found a few trivial looking bugs in the VBAMacroReader which
appear to explain why we have fewer attachments in some files.
Overall, though, we have 6,182 more attachments than we did in 3.17.
2) The new exceptions are nearly all EOF from likely truncated
files...except that we're running into a read limit with exoleObjStg
objects...So I'll bump that up again.
3) There appear to be a few cases where boilerplate from pptx files is
coming through, but the number of times that is happening now is tiny.
See, e.g. govdocs1/627/627647.pptx which now has these words unique to
4.0.0's extraction vs 3.17's:
level: 4 | click: 1 | edit: 1 | fifth: 1 | fourth: 1 | master: 1 |
second: 1 | styles: 1 | text: 1 | third: 1

Overall, I'm +1 on moving forth with the release as is with an updated
xmlbeans and fixing 1) and 3) in 4.0.1.

Did anyone else find any blockers or other issues?
On Fri, Aug 10, 2018 at 2:29 PM Tim Allison <ta...@apache.org> wrote:
>
> Updated reports are here: http://162.242.228.174/reports/poi-4.0.0-reports-b.tgz
>
> Will turn to them shortly...
> On Thu, Aug 9, 2018 at 11:54 AM Tim Allison <ta...@apache.org> wrote:
> >
> > All,
> >   I fixed the three areas for improvement that I found in my first run
> > of regression tests.  I'm going to kick off another run.  Is there
> > anything else we need to do to prep for the release of 4.0.0?
> >   Any idea when we might start the release process if there aren't any
> > other surprises in the regression tests?
> >   Thank you!
> >
> >      Cheers,
> >
> >                 Tim
> > On Wed, Aug 8, 2018 at 10:24 AM Tim Allison <ta...@apache.org> wrote:
> > >
> > > The reports from 3.17 vs 4.0.0-SNAPSHOT are here:
> > > http://162.242.228.174/reports/poi-4.0.0_reports.tar.gz
> > >
> > > Aside from the two issues I've already identified (stackoverflow and
> > > small regression on boilerplate/template identification in ppt), it
> > > looks like more files are being identified as tika-ooxml than .docx or
> > > pptx.  This may be a Tika-level issue, but I want to look into that.
> > >
> > > If anyone notices anything else, please let me know!
> > > On Wed, Aug 8, 2018 at 7:25 AM Tim Allison <ta...@apache.org> wrote:
> > > >
> > > > Hi Andi,
> > > >   I think I'm mostly good.  If you could take a look at:
> > > > https://bz.apache.org/bugzilla/show_bug.cgi?id=62592.  The thmx file
> > > > doesn't open in pptx and may be malformed, but we need to prevent the
> > > > StackOverflowError...infinite recursion.
> > > >   I don't like the patch because it requires the caching of
> > > > ContentTypeEntry after its initial creation, which we're doing
> > > > currently, but it doesn't feel right to require that.
> > > >   So, if you have a better solution, please help! :)
> > > >
> > > >   Thank you!
> > > >
> > > >             Best,
> > > >
> > > >                    Tim
> > > > On Tue, Aug 7, 2018 at 4:54 PM Andreas Beeker <ki...@apache.org> wrote:
> > > > >
> > > > > Hi Tim,
> > > > >
> > > > > On 7/31/18 9:49 PM, Tim Allison wrote:
> > > > > >   I'm trying to upgrade Tika to 4.0.0-SNAPSHOT.
> > > > > >
> > > > > > 2) To confirm OLEShape has become HSLFObjectShape?
> > > > >
> > > > > You are correct.
> > > > >
> > > > > Can I help you with the upgrade?
> > > > >
> > > > > Andi
> > > > >
> > > > >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: upgrading to 4.0.0

Posted by Tim Allison <ta...@apache.org>.
Updated reports are here: http://162.242.228.174/reports/poi-4.0.0-reports-b.tgz

Will turn to them shortly...
On Thu, Aug 9, 2018 at 11:54 AM Tim Allison <ta...@apache.org> wrote:
>
> All,
>   I fixed the three areas for improvement that I found in my first run
> of regression tests.  I'm going to kick off another run.  Is there
> anything else we need to do to prep for the release of 4.0.0?
>   Any idea when we might start the release process if there aren't any
> other surprises in the regression tests?
>   Thank you!
>
>      Cheers,
>
>                 Tim
> On Wed, Aug 8, 2018 at 10:24 AM Tim Allison <ta...@apache.org> wrote:
> >
> > The reports from 3.17 vs 4.0.0-SNAPSHOT are here:
> > http://162.242.228.174/reports/poi-4.0.0_reports.tar.gz
> >
> > Aside from the two issues I've already identified (stackoverflow and
> > small regression on boilerplate/template identification in ppt), it
> > looks like more files are being identified as tika-ooxml than .docx or
> > pptx.  This may be a Tika-level issue, but I want to look into that.
> >
> > If anyone notices anything else, please let me know!
> > On Wed, Aug 8, 2018 at 7:25 AM Tim Allison <ta...@apache.org> wrote:
> > >
> > > Hi Andi,
> > >   I think I'm mostly good.  If you could take a look at:
> > > https://bz.apache.org/bugzilla/show_bug.cgi?id=62592.  The thmx file
> > > doesn't open in pptx and may be malformed, but we need to prevent the
> > > StackOverflowError...infinite recursion.
> > >   I don't like the patch because it requires the caching of
> > > ContentTypeEntry after its initial creation, which we're doing
> > > currently, but it doesn't feel right to require that.
> > >   So, if you have a better solution, please help! :)
> > >
> > >   Thank you!
> > >
> > >             Best,
> > >
> > >                    Tim
> > > On Tue, Aug 7, 2018 at 4:54 PM Andreas Beeker <ki...@apache.org> wrote:
> > > >
> > > > Hi Tim,
> > > >
> > > > On 7/31/18 9:49 PM, Tim Allison wrote:
> > > > >   I'm trying to upgrade Tika to 4.0.0-SNAPSHOT.
> > > > >
> > > > > 2) To confirm OLEShape has become HSLFObjectShape?
> > > >
> > > > You are correct.
> > > >
> > > > Can I help you with the upgrade?
> > > >
> > > > Andi
> > > >
> > > >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: upgrading to 4.0.0

Posted by "pj.fanning" <fa...@yahoo.com>.
Hi Tim,

I don't think there is anything delaying a 4.0.0 release other than getting
the regression suites tested and any fixing any bugs that are found.



--
Sent from: http://apache-poi.1045710.n5.nabble.com/POI-Dev-f2312866.html

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: upgrading to 4.0.0

Posted by Tim Allison <ta...@apache.org>.
All,
  I fixed the three areas for improvement that I found in my first run
of regression tests.  I'm going to kick off another run.  Is there
anything else we need to do to prep for the release of 4.0.0?
  Any idea when we might start the release process if there aren't any
other surprises in the regression tests?
  Thank you!

     Cheers,

                Tim
On Wed, Aug 8, 2018 at 10:24 AM Tim Allison <ta...@apache.org> wrote:
>
> The reports from 3.17 vs 4.0.0-SNAPSHOT are here:
> http://162.242.228.174/reports/poi-4.0.0_reports.tar.gz
>
> Aside from the two issues I've already identified (stackoverflow and
> small regression on boilerplate/template identification in ppt), it
> looks like more files are being identified as tika-ooxml than .docx or
> pptx.  This may be a Tika-level issue, but I want to look into that.
>
> If anyone notices anything else, please let me know!
> On Wed, Aug 8, 2018 at 7:25 AM Tim Allison <ta...@apache.org> wrote:
> >
> > Hi Andi,
> >   I think I'm mostly good.  If you could take a look at:
> > https://bz.apache.org/bugzilla/show_bug.cgi?id=62592.  The thmx file
> > doesn't open in pptx and may be malformed, but we need to prevent the
> > StackOverflowError...infinite recursion.
> >   I don't like the patch because it requires the caching of
> > ContentTypeEntry after its initial creation, which we're doing
> > currently, but it doesn't feel right to require that.
> >   So, if you have a better solution, please help! :)
> >
> >   Thank you!
> >
> >             Best,
> >
> >                    Tim
> > On Tue, Aug 7, 2018 at 4:54 PM Andreas Beeker <ki...@apache.org> wrote:
> > >
> > > Hi Tim,
> > >
> > > On 7/31/18 9:49 PM, Tim Allison wrote:
> > > >   I'm trying to upgrade Tika to 4.0.0-SNAPSHOT.
> > > >
> > > > 2) To confirm OLEShape has become HSLFObjectShape?
> > >
> > > You are correct.
> > >
> > > Can I help you with the upgrade?
> > >
> > > Andi
> > >
> > >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: upgrading to 4.0.0

Posted by Tim Allison <ta...@apache.org>.
The reports from 3.17 vs 4.0.0-SNAPSHOT are here:
http://162.242.228.174/reports/poi-4.0.0_reports.tar.gz

Aside from the two issues I've already identified (stackoverflow and
small regression on boilerplate/template identification in ppt), it
looks like more files are being identified as tika-ooxml than .docx or
pptx.  This may be a Tika-level issue, but I want to look into that.

If anyone notices anything else, please let me know!
On Wed, Aug 8, 2018 at 7:25 AM Tim Allison <ta...@apache.org> wrote:
>
> Hi Andi,
>   I think I'm mostly good.  If you could take a look at:
> https://bz.apache.org/bugzilla/show_bug.cgi?id=62592.  The thmx file
> doesn't open in pptx and may be malformed, but we need to prevent the
> StackOverflowError...infinite recursion.
>   I don't like the patch because it requires the caching of
> ContentTypeEntry after its initial creation, which we're doing
> currently, but it doesn't feel right to require that.
>   So, if you have a better solution, please help! :)
>
>   Thank you!
>
>             Best,
>
>                    Tim
> On Tue, Aug 7, 2018 at 4:54 PM Andreas Beeker <ki...@apache.org> wrote:
> >
> > Hi Tim,
> >
> > On 7/31/18 9:49 PM, Tim Allison wrote:
> > >   I'm trying to upgrade Tika to 4.0.0-SNAPSHOT.
> > >
> > > 2) To confirm OLEShape has become HSLFObjectShape?
> >
> > You are correct.
> >
> > Can I help you with the upgrade?
> >
> > Andi
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org


Re: upgrading to 4.0.0

Posted by Tim Allison <ta...@apache.org>.
Hi Andi,
  I think I'm mostly good.  If you could take a look at:
https://bz.apache.org/bugzilla/show_bug.cgi?id=62592.  The thmx file
doesn't open in pptx and may be malformed, but we need to prevent the
StackOverflowError...infinite recursion.
  I don't like the patch because it requires the caching of
ContentTypeEntry after its initial creation, which we're doing
currently, but it doesn't feel right to require that.
  So, if you have a better solution, please help! :)

  Thank you!

            Best,

                   Tim
On Tue, Aug 7, 2018 at 4:54 PM Andreas Beeker <ki...@apache.org> wrote:
>
> Hi Tim,
>
> On 7/31/18 9:49 PM, Tim Allison wrote:
> >   I'm trying to upgrade Tika to 4.0.0-SNAPSHOT.
> >
> > 2) To confirm OLEShape has become HSLFObjectShape?
>
> You are correct.
>
> Can I help you with the upgrade?
>
> Andi
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@poi.apache.org
For additional commands, e-mail: dev-help@poi.apache.org