You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Tim Allison <ta...@apache.org> on 2022/05/06 12:30:30 UTC

text extraction regression tests for 3.x?

All,
  Let me know when makes sense to run the text extraction regression
tests for 3.x.  I regret I haven't been following our mailing list as
closely as I should be.

           Cheers,

                       Tim

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: text extraction regression tests for 3.x?

Posted by Tim Allison <ta...@apache.org>.
Good to find them now!  Let me know when I should rerun and thank you!

Best,

     Tim

On Sun, May 29, 2022 at 12:37 PM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Thanks Tim,
>
> looks like there are some regressions, see PDFBOX-5444 and PDFBOX-5447.
>
> Maybe there are more to come ....
>
> Andreas
>
>
> Am 26.05.22 um 15:04 schrieb Tim Allison:
> > Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.  The
> > reports are here:
> >
> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
> >
> > Happy to rerun with a more recent version of trunk.
> >
> > Cheers,
> >
> >        Tim
> >
> > On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> >
> >> Am 06.05.22 um 14:30 schrieb Tim Allison:
> >>> All,
> >>>     Let me know when makes sense to run the text extraction regression
> >> Yes, it'd be useful to have some update results.
> >>
> >> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
> >> 3.0.0-alpha3?
> >>
> >>
> >>> tests for 3.x.  I regret I haven't been following our mailing list as
> >>> closely as I should be.
> >> No need to worry, everything is fine.
> >>
> >> Andreas
> >>
> >>>
> >>>              Cheers,
> >>>
> >>>                          Tim
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>
> >>
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: text extraction regression tests for 3.x?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 15.06.22 um 12:19 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
@Tim thanks again

Looks like there aren't any new exceptions in 3.0.0 at all, ergo we are good to 
target a new release  :-)

Andreas

> 
> On Mon, Jun 13, 2022 at 4:54 PM Tim Allison <ta...@apache.org> wrote:
> 
>> Just seeing this now.  Y.  I'll kick off the tests tomorrow morning (ET).
>>
>> On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler <an...@lehmi.de>
>> wrote:
>>
>>> I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456
>>>
>>> @Tim is there any chance to rerun the regression tests?
>>>
>>> Thanks in advance
>>> Andreas
>>>
>>> Am 07.06.22 um 08:06 schrieb Andreas Lehmkuehler:
>>>> I've found another regression, see PDFBOX-5452
>>>>
>>>> Andreas
>>>>
>>>> Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler:
>>>>> Thanks Tim,
>>>>>
>>>>> looks like there are some regressions, see PDFBOX-5444 and PDFBOX-5447.
>>>>>
>>>>> Maybe there are more to come ....
>>>>>
>>>>> Andreas
>>>>>
>>>>>
>>>>> Am 26.05.22 um 15:04 schrieb Tim Allison:
>>>>>> Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.
>>> The
>>>>>> reports are here:
>>>>>>
>>> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
>>>>>>
>>>>>> Happy to rerun with a more recent version of trunk.
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>>         Tim
>>>>>>
>>>>>> On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <an...@lehmi.de>
>>> wrote:
>>>>>>
>>>>>>> Am 06.05.22 um 14:30 schrieb Tim Allison:
>>>>>>>> All,
>>>>>>>>      Let me know when makes sense to run the text extraction
>>> regression
>>>>>>> Yes, it'd be useful to have some update results.
>>>>>>>
>>>>>>> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
>>>>>>> 3.0.0-alpha3?
>>>>>>>
>>>>>>>
>>>>>>>> tests for 3.x.  I regret I haven't been following our mailing list
>>> as
>>>>>>>> closely as I should be.
>>>>>>> No need to worry, everything is fine.
>>>>>>>
>>>>>>> Andreas
>>>>>>>
>>>>>>>>
>>>>>>>>               Cheers,
>>>>>>>>
>>>>>>>>                           Tim
>>>>>>>>
>>>>>>>>
>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: text extraction regression tests for 3.x?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 15.06.22 um 13:07 schrieb Tim Allison:
> In "parse_time_millis_details.xlsx", there are some that took much longer
> in 3.x during the multithreaded run but do not show much of a difference
> singlethreaded...likely accidents of resources available at parse time.
> 
> Overall, the sum of processing times across all files is very similar.
> 
> However, I did find two files that really do take up far more time single
> threaded in 3.x vs. 2.x.  Again, I'm not sure these need to be dealt with
> immediately, and the time required may be a fault of Tika, not PDFBox.
I did some rendering tests and I can't see any significant difference, but I 
didn't do a scientific test with real figures ;-)

> commoncrawl3_refetched/SO/SONYLMWCHDDEOC3D5OHEXDTOJ7NGVODV
The file looks like a pdf containing arabic text, but most of the text isn't 
text at all. The pdf uses line graphics for the content. So, the question is, 
what does TIKA in such cases and why seems 3.x be slower that than 2.x?

> commoncrawl3_refetched/OL/OLZ5TAS53B4BDC673OFMWZE5DDZ7ZGIN
This file is similar to the other one. It contains a lot of graphics and not 
that much text.

Maybe something with the rendering code and/or default settings is different and 
leads to slower results in 3.x?

Andreas
> 
> 
> On Wed, Jun 15, 2022 at 6:49 AM Tim Allison <ta...@apache.org> wrote:
> 
>> I had a chance to look at new_catastrophic_exceptions_in_b, and the three
>> files in there take roughly the same amount of time and resources.  I think
>> they failed on trunk only because of the whims of multithreading and
>> available resources at the time.
>>
>> This file is admittedly quite large, but it was able to take up an
>> unhealthy amount of resources (both RAM and CPU):
>> bug_trackers/evince/evince-LINK-1250-0.pdf in both 2.x and 3.x (sourrce:
>> https://gitlab.gnome.org/GNOME/evince/-/issues/1250).  I don't think
>> there's anything to be done for that one immediately.
>>
>>
>> On Wed, Jun 15, 2022 at 6:19 AM Tim Allison <ta...@apache.org> wrote:
>>
>>> Reports are here:
>>> https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
>>>
>>> On Mon, Jun 13, 2022 at 4:54 PM Tim Allison <ta...@apache.org> wrote:
>>>
>>>> Just seeing this now.  Y.  I'll kick off the tests tomorrow morning (ET).
>>>>
>>>> On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler <an...@lehmi.de>
>>>> wrote:
>>>>
>>>>> I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456
>>>>>
>>>>> @Tim is there any chance to rerun the regression tests?
>>>>>
>>>>> Thanks in advance
>>>>> Andreas
>>>>>
>>>>> Am 07.06.22 um 08:06 schrieb Andreas Lehmkuehler:
>>>>>> I've found another regression, see PDFBOX-5452
>>>>>>
>>>>>> Andreas
>>>>>>
>>>>>> Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler:
>>>>>>> Thanks Tim,
>>>>>>>
>>>>>>> looks like there are some regressions, see PDFBOX-5444 and
>>>>> PDFBOX-5447.
>>>>>>>
>>>>>>> Maybe there are more to come ....
>>>>>>>
>>>>>>> Andreas
>>>>>>>
>>>>>>>
>>>>>>> Am 26.05.22 um 15:04 schrieb Tim Allison:
>>>>>>>> Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.
>>>>> The
>>>>>>>> reports are here:
>>>>>>>>
>>>>> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
>>>>>>>>
>>>>>>>> Happy to rerun with a more recent version of trunk.
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>>         Tim
>>>>>>>>
>>>>>>>> On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <
>>>>> andreas@lehmi.de> wrote:
>>>>>>>>
>>>>>>>>> Am 06.05.22 um 14:30 schrieb Tim Allison:
>>>>>>>>>> All,
>>>>>>>>>>      Let me know when makes sense to run the text extraction
>>>>> regression
>>>>>>>>> Yes, it'd be useful to have some update results.
>>>>>>>>>
>>>>>>>>> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2
>>>>> vs.
>>>>>>>>> 3.0.0-alpha3?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> tests for 3.x.  I regret I haven't been following our mailing
>>>>> list as
>>>>>>>>>> closely as I should be.
>>>>>>>>> No need to worry, everything is fine.
>>>>>>>>>
>>>>>>>>> Andreas
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>               Cheers,
>>>>>>>>>>
>>>>>>>>>>                           Tim
>>>>>>>>>>
>>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: text extraction regression tests for 3.x?

Posted by Tim Allison <ta...@apache.org>.
In "parse_time_millis_details.xlsx", there are some that took much longer
in 3.x during the multithreaded run but do not show much of a difference
singlethreaded...likely accidents of resources available at parse time.

Overall, the sum of processing times across all files is very similar.

However, I did find two files that really do take up far more time single
threaded in 3.x vs. 2.x.  Again, I'm not sure these need to be dealt with
immediately, and the time required may be a fault of Tika, not PDFBox.

commoncrawl3_refetched/SO/SONYLMWCHDDEOC3D5OHEXDTOJ7NGVODV
commoncrawl3_refetched/OL/OLZ5TAS53B4BDC673OFMWZE5DDZ7ZGIN


On Wed, Jun 15, 2022 at 6:49 AM Tim Allison <ta...@apache.org> wrote:

> I had a chance to look at new_catastrophic_exceptions_in_b, and the three
> files in there take roughly the same amount of time and resources.  I think
> they failed on trunk only because of the whims of multithreading and
> available resources at the time.
>
> This file is admittedly quite large, but it was able to take up an
> unhealthy amount of resources (both RAM and CPU):
> bug_trackers/evince/evince-LINK-1250-0.pdf in both 2.x and 3.x (sourrce:
> https://gitlab.gnome.org/GNOME/evince/-/issues/1250).  I don't think
> there's anything to be done for that one immediately.
>
>
> On Wed, Jun 15, 2022 at 6:19 AM Tim Allison <ta...@apache.org> wrote:
>
>> Reports are here:
>> https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
>>
>> On Mon, Jun 13, 2022 at 4:54 PM Tim Allison <ta...@apache.org> wrote:
>>
>>> Just seeing this now.  Y.  I'll kick off the tests tomorrow morning (ET).
>>>
>>> On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler <an...@lehmi.de>
>>> wrote:
>>>
>>>> I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456
>>>>
>>>> @Tim is there any chance to rerun the regression tests?
>>>>
>>>> Thanks in advance
>>>> Andreas
>>>>
>>>> Am 07.06.22 um 08:06 schrieb Andreas Lehmkuehler:
>>>> > I've found another regression, see PDFBOX-5452
>>>> >
>>>> > Andreas
>>>> >
>>>> > Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler:
>>>> >> Thanks Tim,
>>>> >>
>>>> >> looks like there are some regressions, see PDFBOX-5444 and
>>>> PDFBOX-5447.
>>>> >>
>>>> >> Maybe there are more to come ....
>>>> >>
>>>> >> Andreas
>>>> >>
>>>> >>
>>>> >> Am 26.05.22 um 15:04 schrieb Tim Allison:
>>>> >>> Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.
>>>> The
>>>> >>> reports are here:
>>>> >>>
>>>> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
>>>> >>>
>>>> >>> Happy to rerun with a more recent version of trunk.
>>>> >>>
>>>> >>> Cheers,
>>>> >>>
>>>> >>>        Tim
>>>> >>>
>>>> >>> On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <
>>>> andreas@lehmi.de> wrote:
>>>> >>>
>>>> >>>> Am 06.05.22 um 14:30 schrieb Tim Allison:
>>>> >>>>> All,
>>>> >>>>>     Let me know when makes sense to run the text extraction
>>>> regression
>>>> >>>> Yes, it'd be useful to have some update results.
>>>> >>>>
>>>> >>>> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2
>>>> vs.
>>>> >>>> 3.0.0-alpha3?
>>>> >>>>
>>>> >>>>
>>>> >>>>> tests for 3.x.  I regret I haven't been following our mailing
>>>> list as
>>>> >>>>> closely as I should be.
>>>> >>>> No need to worry, everything is fine.
>>>> >>>>
>>>> >>>> Andreas
>>>> >>>>
>>>> >>>>>
>>>> >>>>>              Cheers,
>>>> >>>>>
>>>> >>>>>                          Tim
>>>> >>>>>
>>>> >>>>>
>>>> ---------------------------------------------------------------------
>>>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> >>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>> >>>>>
>>>> >>>>
>>>> >>>>
>>>> >>>>
>>>> ---------------------------------------------------------------------
>>>> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>> >>>>
>>>> >>>>
>>>> >>>
>>>> >>
>>>> >>
>>>> >> ---------------------------------------------------------------------
>>>> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>> >>
>>>> >
>>>> >
>>>> > ---------------------------------------------------------------------
>>>> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> > For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>> >
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>>

Re: text extraction regression tests for 3.x?

Posted by Tim Allison <ta...@apache.org>.
I had a chance to look at new_catastrophic_exceptions_in_b, and the three
files in there take roughly the same amount of time and resources.  I think
they failed on trunk only because of the whims of multithreading and
available resources at the time.

This file is admittedly quite large, but it was able to take up an
unhealthy amount of resources (both RAM and CPU):
bug_trackers/evince/evince-LINK-1250-0.pdf in both 2.x and 3.x (sourrce:
https://gitlab.gnome.org/GNOME/evince/-/issues/1250).  I don't think
there's anything to be done for that one immediately.


On Wed, Jun 15, 2022 at 6:19 AM Tim Allison <ta...@apache.org> wrote:

> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
>
> On Mon, Jun 13, 2022 at 4:54 PM Tim Allison <ta...@apache.org> wrote:
>
>> Just seeing this now.  Y.  I'll kick off the tests tomorrow morning (ET).
>>
>> On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler <an...@lehmi.de>
>> wrote:
>>
>>> I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456
>>>
>>> @Tim is there any chance to rerun the regression tests?
>>>
>>> Thanks in advance
>>> Andreas
>>>
>>> Am 07.06.22 um 08:06 schrieb Andreas Lehmkuehler:
>>> > I've found another regression, see PDFBOX-5452
>>> >
>>> > Andreas
>>> >
>>> > Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler:
>>> >> Thanks Tim,
>>> >>
>>> >> looks like there are some regressions, see PDFBOX-5444 and
>>> PDFBOX-5447.
>>> >>
>>> >> Maybe there are more to come ....
>>> >>
>>> >> Andreas
>>> >>
>>> >>
>>> >> Am 26.05.22 um 15:04 schrieb Tim Allison:
>>> >>> Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.
>>> The
>>> >>> reports are here:
>>> >>>
>>> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
>>> >>>
>>> >>> Happy to rerun with a more recent version of trunk.
>>> >>>
>>> >>> Cheers,
>>> >>>
>>> >>>        Tim
>>> >>>
>>> >>> On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <an...@lehmi.de>
>>> wrote:
>>> >>>
>>> >>>> Am 06.05.22 um 14:30 schrieb Tim Allison:
>>> >>>>> All,
>>> >>>>>     Let me know when makes sense to run the text extraction
>>> regression
>>> >>>> Yes, it'd be useful to have some update results.
>>> >>>>
>>> >>>> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2
>>> vs.
>>> >>>> 3.0.0-alpha3?
>>> >>>>
>>> >>>>
>>> >>>>> tests for 3.x.  I regret I haven't been following our mailing list
>>> as
>>> >>>>> closely as I should be.
>>> >>>> No need to worry, everything is fine.
>>> >>>>
>>> >>>> Andreas
>>> >>>>
>>> >>>>>
>>> >>>>>              Cheers,
>>> >>>>>
>>> >>>>>                          Tim
>>> >>>>>
>>> >>>>>
>>> ---------------------------------------------------------------------
>>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> >>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> ---------------------------------------------------------------------
>>> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>> >>>>
>>> >>>>
>>> >>>
>>> >>
>>> >>
>>> >> ---------------------------------------------------------------------
>>> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>> >>
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> > For additional commands, e-mail: dev-help@pdfbox.apache.org
>>> >
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>

Re: text extraction regression tests for 3.x?

Posted by Tim Allison <ta...@apache.org>.
I wouldn't. :D

On Thu, Jun 16, 2022 at 12:16 PM Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 15.06.2022 um 12:19 schrieb Tim Allison:
> > Reports are here:
> > https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz
>
> govdocs1/372/372582.pdf
> commoncrawl3/KH/KHDACXIPFMWP632LZ3S4TRRSZPDGHGM5
> commoncrawl3/VN/VNCWMY6Y4C3XYWA65CQPPSNZSY6OQEEA
>
> have lost text. But the first one is a mess even with with 2.0.26, and
> the two others are truncated. I wonder if we really need to bother.
>
> Tilman
>

Re: text extraction regression tests for 3.x?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 15.06.2022 um 12:19 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz

govdocs1/372/372582.pdf
commoncrawl3/KH/KHDACXIPFMWP632LZ3S4TRRSZPDGHGM5
commoncrawl3/VN/VNCWMY6Y4C3XYWA65CQPPSNZSY6OQEEA

have lost text. But the first one is a mess even with with 2.0.26, and 
the two others are truncated. I wonder if we really need to bother.

Tilman

Re: text extraction regression tests for 3.x?

Posted by Tim Allison <ta...@apache.org>.
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-3-20220614.tgz

On Mon, Jun 13, 2022 at 4:54 PM Tim Allison <ta...@apache.org> wrote:

> Just seeing this now.  Y.  I'll kick off the tests tomorrow morning (ET).
>
> On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
>
>> I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456
>>
>> @Tim is there any chance to rerun the regression tests?
>>
>> Thanks in advance
>> Andreas
>>
>> Am 07.06.22 um 08:06 schrieb Andreas Lehmkuehler:
>> > I've found another regression, see PDFBOX-5452
>> >
>> > Andreas
>> >
>> > Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler:
>> >> Thanks Tim,
>> >>
>> >> looks like there are some regressions, see PDFBOX-5444 and PDFBOX-5447.
>> >>
>> >> Maybe there are more to come ....
>> >>
>> >> Andreas
>> >>
>> >>
>> >> Am 26.05.22 um 15:04 schrieb Tim Allison:
>> >>> Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.
>> The
>> >>> reports are here:
>> >>>
>> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
>> >>>
>> >>> Happy to rerun with a more recent version of trunk.
>> >>>
>> >>> Cheers,
>> >>>
>> >>>        Tim
>> >>>
>> >>> On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <an...@lehmi.de>
>> wrote:
>> >>>
>> >>>> Am 06.05.22 um 14:30 schrieb Tim Allison:
>> >>>>> All,
>> >>>>>     Let me know when makes sense to run the text extraction
>> regression
>> >>>> Yes, it'd be useful to have some update results.
>> >>>>
>> >>>> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
>> >>>> 3.0.0-alpha3?
>> >>>>
>> >>>>
>> >>>>> tests for 3.x.  I regret I haven't been following our mailing list
>> as
>> >>>>> closely as I should be.
>> >>>> No need to worry, everything is fine.
>> >>>>
>> >>>> Andreas
>> >>>>
>> >>>>>
>> >>>>>              Cheers,
>> >>>>>
>> >>>>>                          Tim
>> >>>>>
>> >>>>>
>> ---------------------------------------------------------------------
>> >>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> >>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>> >>>>>
>> >>>>
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>> >>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
>> >>
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> > For additional commands, e-mail: dev-help@pdfbox.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>

Re: text extraction regression tests for 3.x?

Posted by Tim Allison <ta...@apache.org>.
Just seeing this now.  Y.  I'll kick off the tests tomorrow morning (ET).

On Sat, Jun 11, 2022 at 8:09 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456
>
> @Tim is there any chance to rerun the regression tests?
>
> Thanks in advance
> Andreas
>
> Am 07.06.22 um 08:06 schrieb Andreas Lehmkuehler:
> > I've found another regression, see PDFBOX-5452
> >
> > Andreas
> >
> > Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler:
> >> Thanks Tim,
> >>
> >> looks like there are some regressions, see PDFBOX-5444 and PDFBOX-5447.
> >>
> >> Maybe there are more to come ....
> >>
> >> Andreas
> >>
> >>
> >> Am 26.05.22 um 15:04 schrieb Tim Allison:
> >>> Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.  The
> >>> reports are here:
> >>>
> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
> >>>
> >>> Happy to rerun with a more recent version of trunk.
> >>>
> >>> Cheers,
> >>>
> >>>        Tim
> >>>
> >>> On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> >>>
> >>>> Am 06.05.22 um 14:30 schrieb Tim Allison:
> >>>>> All,
> >>>>>     Let me know when makes sense to run the text extraction
> regression
> >>>> Yes, it'd be useful to have some update results.
> >>>>
> >>>> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
> >>>> 3.0.0-alpha3?
> >>>>
> >>>>
> >>>>> tests for 3.x.  I regret I haven't been following our mailing list as
> >>>>> closely as I should be.
> >>>> No need to worry, everything is fine.
> >>>>
> >>>> Andreas
> >>>>
> >>>>>
> >>>>>              Cheers,
> >>>>>
> >>>>>                          Tim
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>>
> >>>>
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: text extraction regression tests for 3.x?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
I've fixed PDFBOX-5452 and found/fixed another one, see PDFBOX-5456

@Tim is there any chance to rerun the regression tests?

Thanks in advance
Andreas

Am 07.06.22 um 08:06 schrieb Andreas Lehmkuehler:
> I've found another regression, see PDFBOX-5452
> 
> Andreas
> 
> Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler:
>> Thanks Tim,
>>
>> looks like there are some regressions, see PDFBOX-5444 and PDFBOX-5447.
>>
>> Maybe there are more to come ....
>>
>> Andreas
>>
>>
>> Am 26.05.22 um 15:04 schrieb Tim Allison:
>>> Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.  The
>>> reports are here:
>>> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
>>>
>>> Happy to rerun with a more recent version of trunk.
>>>
>>> Cheers,
>>>
>>>        Tim
>>>
>>> On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <an...@lehmi.de> wrote:
>>>
>>>> Am 06.05.22 um 14:30 schrieb Tim Allison:
>>>>> All,
>>>>>     Let me know when makes sense to run the text extraction regression
>>>> Yes, it'd be useful to have some update results.
>>>>
>>>> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
>>>> 3.0.0-alpha3?
>>>>
>>>>
>>>>> tests for 3.x.  I regret I haven't been following our mailing list as
>>>>> closely as I should be.
>>>> No need to worry, everything is fine.
>>>>
>>>> Andreas
>>>>
>>>>>
>>>>>              Cheers,
>>>>>
>>>>>                          Tim
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>>
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: text extraction regression tests for 3.x?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
I've found another regression, see PDFBOX-5452

Andreas

Am 29.05.22 um 18:37 schrieb Andreas Lehmkuehler:
> Thanks Tim,
> 
> looks like there are some regressions, see PDFBOX-5444 and PDFBOX-5447.
> 
> Maybe there are more to come ....
> 
> Andreas
> 
> 
> Am 26.05.22 um 15:04 schrieb Tim Allison:
>> Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.  The
>> reports are here:
>> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
>>
>> Happy to rerun with a more recent version of trunk.
>>
>> Cheers,
>>
>>        Tim
>>
>> On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <an...@lehmi.de> wrote:
>>
>>> Am 06.05.22 um 14:30 schrieb Tim Allison:
>>>> All,
>>>>     Let me know when makes sense to run the text extraction regression
>>> Yes, it'd be useful to have some update results.
>>>
>>> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
>>> 3.0.0-alpha3?
>>>
>>>
>>>> tests for 3.x.  I regret I haven't been following our mailing list as
>>>> closely as I should be.
>>> No need to worry, everything is fine.
>>>
>>> Andreas
>>>
>>>>
>>>>              Cheers,
>>>>
>>>>                          Tim
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: text extraction regression tests for 3.x?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Thanks Tim,

looks like there are some regressions, see PDFBOX-5444 and PDFBOX-5447.

Maybe there are more to come ....

Andreas


Am 26.05.22 um 15:04 schrieb Tim Allison:
> Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.  The
> reports are here:
> https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz
> 
> Happy to rerun with a more recent version of trunk.
> 
> Cheers,
> 
>        Tim
> 
> On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <an...@lehmi.de> wrote:
> 
>> Am 06.05.22 um 14:30 schrieb Tim Allison:
>>> All,
>>>     Let me know when makes sense to run the text extraction regression
>> Yes, it'd be useful to have some update results.
>>
>> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
>> 3.0.0-alpha3?
>>
>>
>>> tests for 3.x.  I regret I haven't been following our mailing list as
>>> closely as I should be.
>> No need to worry, everything is fine.
>>
>> Andreas
>>
>>>
>>>              Cheers,
>>>
>>>                          Tim
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: text extraction regression tests for 3.x?

Posted by Tim Allison <ta...@apache.org>.
Apologies for my delay.  I ran trunk/3.x on May 12 against 2.0.26.  The
reports are here:
https://corpora.tika.apache.org/base/reports/reports_pdfbox_3x_20220512.tgz

Happy to rerun with a more recent version of trunk.

Cheers,

      Tim

On Sun, May 8, 2022 at 1:21 PM Andreas Lehmkuehler <an...@lehmi.de> wrote:

> Am 06.05.22 um 14:30 schrieb Tim Allison:
> > All,
> >    Let me know when makes sense to run the text extraction regression
> Yes, it'd be useful to have some update results.
>
> How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs.
> 3.0.0-alpha3?
>
>
> > tests for 3.x.  I regret I haven't been following our mailing list as
> > closely as I should be.
> No need to worry, everything is fine.
>
> Andreas
>
> >
> >             Cheers,
> >
> >                         Tim
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: text extraction regression tests for 3.x?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 06.05.22 um 14:30 schrieb Tim Allison:
> All,
>    Let me know when makes sense to run the text extraction regression
Yes, it'd be useful to have some update results.

How about comparing 2.0.26 vs 3.0.0-alpha3 and maybe 3.0.0-alpha2 vs. 3.0.0-alpha3?


> tests for 3.x.  I regret I haven't been following our mailing list as
> closely as I should be.
No need to worry, everything is fine.

Andreas

> 
>             Cheers,
> 
>                         Tim
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org