You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Tim Allison <ta...@apache.org> on 2021/04/06 15:22:03 UTC

3.0.0-RC1 regression tests?

Hi All,

  Would it be useful for me to run regression tests comparing 2.x with
3.0.0-RC1 now or should I wait?  Or, has someone already done this?

  See https://issues.apache.org/jira/browse/TIKA-3347 for integration
with Tika.  Many thanks!

      Cheers,

           Tim

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 3.0.0-RC1 regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.
Hello Tim,

Could you please start another "B" batch + eval? I think we've fixed 
most, maybe all.

Thanks

Tilman

Am 09.04.2021 um 20:11 schrieb Tim Allison:
> Apologies for my delay...
>
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-3.x-snapshot-reports.tgz
>
> I added two new reports new_catastrophic_exceptions_in_b and
> fixed_catastrophic_exceptions_in_b.  The former shows which files had
> a missing or 0-byte extract in B but not A.  The latter shows the
> opposite.  We can get missing or 0-byte extracts when the app crashes
> (timeout or oom or other fatal crash).  Given that this is
> multithreaded, all files that are currently being parsed during a
> catastrophic event will have a 0-byte or missing extract.  So, there
> are likely some files in there that are ok.
>
> I ran the comparison before the fix for the infinite loop that Tilman
> made this morning.  Note that that was a regular IOException because
> TikaInputStream identified it because of too many EOFs...that did not
> cause catastrophic problems.
>
> Let me know if you have questions.  I haven't looked in great detail yet...
>
> There's every chance that I need to make some more changes on the Tika side. :D
>
> Cheers and happy 3.x!
>
> Best,
>
>        Tim
>
> On Wed, Apr 7, 2021 at 9:23 AM Tim Allison <ta...@apache.org> wrote:
>> LOL...  K.  I'll build locally with the PDFBOX-5153 fix and kick it
>> off today or tomorrow.
>>
>> On Wed, Apr 7, 2021 at 1:40 AM Tilman Hausherr <TH...@t-online.de> wrote:
>>> Yes it would be useful and no I haven't done it. I'm optimistic about
>>> the results despite PDFBOX-5153.
>>>
>>> Tilman
>>>
>>> Am 06.04.2021 um 17:22 schrieb Tim Allison:
>>>> Hi All,
>>>>
>>>>     Would it be useful for me to run regression tests comparing 2.x with
>>>> 3.0.0-RC1 now or should I wait?  Or, has someone already done this?
>>>>
>>>>     See https://issues.apache.org/jira/browse/TIKA-3347 for integration
>>>> with Tika.  Many thanks!
>>>>
>>>>         Cheers,
>>>>
>>>>              Tim
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 3.0.0-RC1 regression tests?

Posted by Tim Allison <ta...@apache.org>.
Apologies for my delay...

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-3.x-snapshot-reports.tgz

I added two new reports new_catastrophic_exceptions_in_b and
fixed_catastrophic_exceptions_in_b.  The former shows which files had
a missing or 0-byte extract in B but not A.  The latter shows the
opposite.  We can get missing or 0-byte extracts when the app crashes
(timeout or oom or other fatal crash).  Given that this is
multithreaded, all files that are currently being parsed during a
catastrophic event will have a 0-byte or missing extract.  So, there
are likely some files in there that are ok.

I ran the comparison before the fix for the infinite loop that Tilman
made this morning.  Note that that was a regular IOException because
TikaInputStream identified it because of too many EOFs...that did not
cause catastrophic problems.

Let me know if you have questions.  I haven't looked in great detail yet...

There's every chance that I need to make some more changes on the Tika side. :D

Cheers and happy 3.x!

Best,

      Tim

On Wed, Apr 7, 2021 at 9:23 AM Tim Allison <ta...@apache.org> wrote:
>
> LOL...  K.  I'll build locally with the PDFBOX-5153 fix and kick it
> off today or tomorrow.
>
> On Wed, Apr 7, 2021 at 1:40 AM Tilman Hausherr <TH...@t-online.de> wrote:
> >
> > Yes it would be useful and no I haven't done it. I'm optimistic about
> > the results despite PDFBOX-5153.
> >
> > Tilman
> >
> > Am 06.04.2021 um 17:22 schrieb Tim Allison:
> > > Hi All,
> > >
> > >    Would it be useful for me to run regression tests comparing 2.x with
> > > 3.0.0-RC1 now or should I wait?  Or, has someone already done this?
> > >
> > >    See https://issues.apache.org/jira/browse/TIKA-3347 for integration
> > > with Tika.  Many thanks!
> > >
> > >        Cheers,
> > >
> > >             Tim
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > > For additional commands, e-mail: dev-help@pdfbox.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 3.0.0-RC1 regression tests?

Posted by Tim Allison <ta...@apache.org>.
LOL...  K.  I'll build locally with the PDFBOX-5153 fix and kick it
off today or tomorrow.

On Wed, Apr 7, 2021 at 1:40 AM Tilman Hausherr <TH...@t-online.de> wrote:
>
> Yes it would be useful and no I haven't done it. I'm optimistic about
> the results despite PDFBOX-5153.
>
> Tilman
>
> Am 06.04.2021 um 17:22 schrieb Tim Allison:
> > Hi All,
> >
> >    Would it be useful for me to run regression tests comparing 2.x with
> > 3.0.0-RC1 now or should I wait?  Or, has someone already done this?
> >
> >    See https://issues.apache.org/jira/browse/TIKA-3347 for integration
> > with Tika.  Many thanks!
> >
> >        Cheers,
> >
> >             Tim
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 3.0.0-RC1 regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.
Yes it would be useful and no I haven't done it. I'm optimistic about 
the results despite PDFBOX-5153.

Tilman

Am 06.04.2021 um 17:22 schrieb Tim Allison:
> Hi All,
>
>    Would it be useful for me to run regression tests comparing 2.x with
> 3.0.0-RC1 now or should I wait?  Or, has someone already done this?
>
>    See https://issues.apache.org/jira/browse/TIKA-3347 for integration
> with Tika.  Many thanks!
>
>        Cheers,
>
>             Tim
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 3.0.0-RC1 regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

that would be nice, I guess the last comparison is quite some time ago.

Cheers
Andreas


Am 06.04.21 um 17:22 schrieb Tim Allison:
> Hi All,
> 
>    Would it be useful for me to run regression tests comparing 2.x with
> 3.0.0-RC1 now or should I wait?  Or, has someone already done this?
> 
>    See https://issues.apache.org/jira/browse/TIKA-3347 for integration
> with Tika.  Many thanks!
> 
>        Cheers,
> 
>             Tim
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org