You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Andreas Lehmkuehler <an...@lehmi.de> on 2019/04/05 04:31:55 UTC

Release 2.0.15 ?

Hi,

looks like it's time for the next release. How about cutting 2.0.15 next monday?

WDYT?

Andreas

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Release 2.0.15 ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 06.04.19 um 17:19 schrieb Tilman Hausherr:
> I looked at about 10 files... all are rotated. I suspect this is a result of 
> PDFBOX-4480, that previously some rotated words came as one. But this doesn't 
> matter, the overall extraction of rotated pages would still look bad.
PDFBOX-4480 is one reason but it isn't the only one. The unsorted results 
without those changes are different than those from 2.0.14.

> 
> For example, the file you mention extracted this in 2.0.14:
> 
> ...
> R
> E
> R
> M
> H
> IV
> -1
> infection
> hum
> an(B
> 8)
> [G
> oulder97c]
> ...
> 
> So it had "infection" but the rest was still worthless. The same file extracts 
> nicely with the "rotationMagic" option of ExtractText.
I agree with Tilman due to the worthless unsorted results one can't say that one 
is better or worst than the other. Only the sorted results are useful and those 
are equal. Saying that, IMHO this is not a regression

> 
> Tilman
> 
> Am 06.04.2019 um 15:50 schrieb Tim Allison:
>> http://162.242.228.174/reports/reports_pdfbox_2.0.15-SNAPSHOT.tgz
>>
>> This compares 2.0.15-SNAPSHOT with 2.0.13 (I think)...IIRC, though,
>> there were no content differences btwn 2.0.13 and 2.0.14.  I did not
>> apply angle detection.
>>
>> No new exceptions; 2 fixed exceptions.  We're getting higher page
>> counts in a few documents, because we overrode processPages() to
>> process.  Some changes in content, but overall, better, I think, based
>> on contents/common_token_comparisons_by_mime.xlsx.
>>
>> To see where content appears to degrade, open
>> contents/content_diffs_(no|with)_exceptions, and sort column M
>> ('NUM_COMMON_TOKENS_DIFF_IN_B') in ascending order.  Also, look at
>> columns R (TOP_10_UNIQUE_TOKEN_DIFFS_A) and S
>> (TOP_10_UNIQUE_TOKEN_DIFFS_B)...these columns show the top 10 most
>> frequent tokens that are unique to A or unique to B; from this, it
>> looks like there is a regression in, e.g. govdocs1/038/038519.pdf,
>> but, generally (hand waving), it appears that there were word
>> segmentation problems in both A and B as I look at the results.
>>
>> Cheers,
>>
>>               Tim
>>
>> On Fri, Apr 5, 2019 at 10:53 AM Tim Allison <ta...@apache.org> wrote:
>>> +1 I should have regression results by tomorrow
>>>
>>> On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>>>> +1
>>>>
>>>>> Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>>>>
>>>>> Hi,
>>>>>
>>>>> looks like it's time for the next release. How about cutting 2.0.15 next 
>>>>> monday?
>>>>>
>>>>> WDYT?
>>>>>
>>>>> Andreas
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Release 2.0.15 ?

Posted by Tim Allison <ta...@apache.org>.
>But this doesn't matter, the overall extraction of rotated pages would
still look bad.

>>...it appears that there were word segmentation problems in both A and B

Y. I agree.  Thank you for looking more carefully than I did!


On Sat, Apr 6, 2019 at 11:19 AM Tilman Hausherr <TH...@t-online.de>
wrote:

> I looked at about 10 files... all are rotated. I suspect this is a
> result of PDFBOX-4480, that previously some rotated words came as one.
> But this doesn't matter, the overall extraction of rotated pages would
> still look bad.
>
> For example, the file you mention extracted this in 2.0.14:
>
> ...
> R
> E
> R
> M
> H
> IV
> -1
> infection
> hum
> an(B
> 8)
> [G
> oulder97c]
> ...
>
> So it had "infection" but the rest was still worthless. The same file
> extracts nicely with the "rotationMagic" option of ExtractText.
>
> Tilman
>
> Am 06.04.2019 um 15:50 schrieb Tim Allison:
> > http://162.242.228.174/reports/reports_pdfbox_2.0.15-SNAPSHOT.tgz
> >
> > This compares 2.0.15-SNAPSHOT with 2.0.13 (I think)...IIRC, though,
> > there were no content differences btwn 2.0.13 and 2.0.14.  I did not
> > apply angle detection.
> >
> > No new exceptions; 2 fixed exceptions.  We're getting higher page
> > counts in a few documents, because we overrode processPages() to
> > process.  Some changes in content, but overall, better, I think, based
> > on contents/common_token_comparisons_by_mime.xlsx.
> >
> > To see where content appears to degrade, open
> > contents/content_diffs_(no|with)_exceptions, and sort column M
> > ('NUM_COMMON_TOKENS_DIFF_IN_B') in ascending order.  Also, look at
> > columns R (TOP_10_UNIQUE_TOKEN_DIFFS_A) and S
> > (TOP_10_UNIQUE_TOKEN_DIFFS_B)...these columns show the top 10 most
> > frequent tokens that are unique to A or unique to B; from this, it
> > looks like there is a regression in, e.g. govdocs1/038/038519.pdf,
> > but, generally (hand waving), it appears that there were word
> > segmentation problems in both A and B as I look at the results.
> >
> > Cheers,
> >
> >               Tim
> >
> > On Fri, Apr 5, 2019 at 10:53 AM Tim Allison <ta...@apache.org> wrote:
> >> +1 I should have regression results by tomorrow
> >>
> >> On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun <sa...@fileaffairs.de>
> wrote:
> >>> +1
> >>>
> >>>> Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <andreas@lehmi.de
> >:
> >>>>
> >>>> Hi,
> >>>>
> >>>> looks like it's time for the next release. How about cutting 2.0.15
> next monday?
> >>>>
> >>>> WDYT?
> >>>>
> >>>> Andreas
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: Release 2.0.15 ?

Posted by Tilman Hausherr <TH...@t-online.de>.
I looked at about 10 files... all are rotated. I suspect this is a 
result of PDFBOX-4480, that previously some rotated words came as one. 
But this doesn't matter, the overall extraction of rotated pages would 
still look bad.

For example, the file you mention extracted this in 2.0.14:

...
R
E
R
M
H
IV
-1
infection
hum
an(B
8)
[G
oulder97c]
...

So it had "infection" but the rest was still worthless. The same file 
extracts nicely with the "rotationMagic" option of ExtractText.

Tilman

Am 06.04.2019 um 15:50 schrieb Tim Allison:
> http://162.242.228.174/reports/reports_pdfbox_2.0.15-SNAPSHOT.tgz
>
> This compares 2.0.15-SNAPSHOT with 2.0.13 (I think)...IIRC, though,
> there were no content differences btwn 2.0.13 and 2.0.14.  I did not
> apply angle detection.
>
> No new exceptions; 2 fixed exceptions.  We're getting higher page
> counts in a few documents, because we overrode processPages() to
> process.  Some changes in content, but overall, better, I think, based
> on contents/common_token_comparisons_by_mime.xlsx.
>
> To see where content appears to degrade, open
> contents/content_diffs_(no|with)_exceptions, and sort column M
> ('NUM_COMMON_TOKENS_DIFF_IN_B') in ascending order.  Also, look at
> columns R (TOP_10_UNIQUE_TOKEN_DIFFS_A) and S
> (TOP_10_UNIQUE_TOKEN_DIFFS_B)...these columns show the top 10 most
> frequent tokens that are unique to A or unique to B; from this, it
> looks like there is a regression in, e.g. govdocs1/038/038519.pdf,
> but, generally (hand waving), it appears that there were word
> segmentation problems in both A and B as I look at the results.
>
> Cheers,
>
>               Tim
>
> On Fri, Apr 5, 2019 at 10:53 AM Tim Allison <ta...@apache.org> wrote:
>> +1 I should have regression results by tomorrow
>>
>> On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>>> +1
>>>
>>>> Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>>>
>>>> Hi,
>>>>
>>>> looks like it's time for the next release. How about cutting 2.0.15 next monday?
>>>>
>>>> WDYT?
>>>>
>>>> Andreas
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Release 2.0.15 ?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 06.04.19 um 15:50 schrieb Tim Allison:
> http://162.242.228.174/reports/reports_pdfbox_2.0.15-SNAPSHOT.tgz
> 
> This compares 2.0.15-SNAPSHOT with 2.0.13 (I think)...IIRC, though,
> there were no content differences btwn 2.0.13 and 2.0.14.  I did not
> apply angle detection.
Thanks again for running the tests

> No new exceptions; 2 fixed exceptions.  We're getting higher page
> counts in a few documents, because we overrode processPages() to
> process.  Some changes in content, but overall, better, I think, based
> on contents/common_token_comparisons_by_mime.xlsx.
> 
> To see where content appears to degrade, open
> contents/content_diffs_(no|with)_exceptions, and sort column M
> ('NUM_COMMON_TOKENS_DIFF_IN_B') in ascending order.  Also, look at
> columns R (TOP_10_UNIQUE_TOKEN_DIFFS_A) and S
> (TOP_10_UNIQUE_TOKEN_DIFFS_B)...these columns show the top 10 most
> frequent tokens that are unique to A or unique to B; from this, it
> looks like there is a regression in, e.g. govdocs1/038/038519.pdf,
> but, generally (hand waving), it appears that there were word
> segmentation problems in both A and B as I look at the results.
I had a first look and there are differences, but I'm not sure if it is a 
regression.

The sorted text extraction results from 2.0.13/14 and  2.0.15-SNAPSHOT are 
equal. The unsorted results from 2.0.13/14 are equal but those from 
2.0.15-SNAPSHOT are different.

Still investigating ...

Andreas

> 
> Cheers,
> 
>               Tim
> 
> On Fri, Apr 5, 2019 at 10:53 AM Tim Allison <ta...@apache.org> wrote:
>>
>> +1 I should have regression results by tomorrow
>>
>> On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>>>
>>> +1
>>>
>>>> Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>>>>
>>>> Hi,
>>>>
>>>> looks like it's time for the next release. How about cutting 2.0.15 next monday?
>>>>
>>>> WDYT?
>>>>
>>>> Andreas
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Release 2.0.15 ?

Posted by Tim Allison <ta...@apache.org>.
http://162.242.228.174/reports/reports_pdfbox_2.0.15-SNAPSHOT.tgz

This compares 2.0.15-SNAPSHOT with 2.0.13 (I think)...IIRC, though,
there were no content differences btwn 2.0.13 and 2.0.14.  I did not
apply angle detection.

No new exceptions; 2 fixed exceptions.  We're getting higher page
counts in a few documents, because we overrode processPages() to
process.  Some changes in content, but overall, better, I think, based
on contents/common_token_comparisons_by_mime.xlsx.

To see where content appears to degrade, open
contents/content_diffs_(no|with)_exceptions, and sort column M
('NUM_COMMON_TOKENS_DIFF_IN_B') in ascending order.  Also, look at
columns R (TOP_10_UNIQUE_TOKEN_DIFFS_A) and S
(TOP_10_UNIQUE_TOKEN_DIFFS_B)...these columns show the top 10 most
frequent tokens that are unique to A or unique to B; from this, it
looks like there is a regression in, e.g. govdocs1/038/038519.pdf,
but, generally (hand waving), it appears that there were word
segmentation problems in both A and B as I look at the results.

Cheers,

             Tim

On Fri, Apr 5, 2019 at 10:53 AM Tim Allison <ta...@apache.org> wrote:
>
> +1 I should have regression results by tomorrow
>
> On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun <sa...@fileaffairs.de> wrote:
>>
>> +1
>>
>> > Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
>> >
>> > Hi,
>> >
>> > looks like it's time for the next release. How about cutting 2.0.15 next monday?
>> >
>> > WDYT?
>> >
>> > Andreas
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> > For additional commands, e-mail: dev-help@pdfbox.apache.org
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> > For additional commands, e-mail: dev-help@pdfbox.apache.org
>> >
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Release 2.0.15 ?

Posted by Tim Allison <ta...@apache.org>.
+1 I should have regression results by tomorrow

On Fri, Apr 5, 2019 at 2:15 AM Maruan Sahyoun <sa...@fileaffairs.de>
wrote:

> +1
>
> > Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
> >
> > Hi,
> >
> > looks like it's time for the next release. How about cutting 2.0.15 next
> monday?
> >
> > WDYT?
> >
> > Andreas
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: Release 2.0.15 ?

Posted by Maruan Sahyoun <sa...@fileaffairs.de>.
+1

> Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler <an...@lehmi.de>:
> 
> Hi,
> 
> looks like it's time for the next release. How about cutting 2.0.15 next monday?
> 
> WDYT?
> 
> Andreas
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: Release 2.0.15 ?

Posted by Tilman Hausherr <TH...@t-online.de>.
+1

Tilman

Am 05.04.2019 um 06:31 schrieb Andreas Lehmkuehler:
> Hi,
>
> looks like it's time for the next release. How about cutting 2.0.15 
> next monday?
>
> WDYT?
>
> Andreas
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org