You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Tilman Hausherr <TH...@t-online.de> on 2021/03/07 05:04:49 UTC

2.0.22 vs 2.0.23

Report is here:

http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23.tar.xz



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Hi,

@Tilman
Thanks for running all those tests. The results are looking good to me, although 
I've to admit that I don't understand every bit of those sheets especially those 
about the content ;-)

However, I'm planning to cut the 2.0.23 release next Monday.

Andreas

Am 08.03.21 um 11:17 schrieb Tilman Hausherr:
> new report:
> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_3.tar.xz
> 
> Tilman
> 
> Am 08.03.2021 um 10:35 schrieb Tilman Hausherr:
>> I think we're good (despite the differences, most of which are because of the 
>> soft hyphen), but I'm now experimenting with a modified version of tika-eval 
>> to see what happens.
>>
>> Tilman
>>
>> Am 07.03.2021 um 19:47 schrieb Tilman Hausherr:
>>> new report at
>>>
>>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_2.tar.xz
>>>
>>> Tilman
>>>
>>> Am 07.03.2021 um 11:43 schrieb Tilman Hausherr:
>>>> Am 07.03.2021 um 06:04 schrieb Tilman Hausherr:
>>>>> Report is here:
>>>>>
>>>>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23.tar.xz 
>>>>
>>>> There's not much changed. No new exceptions. Re content, the changes that 
>>>> seem important are all related to "soft hyphen".
>>>>
>>>> https://issues.apache.org/jira/browse/PDFBOX-5115
>>>>
>>>> I am currently fixing this, and then I'll run the tests again. The text 
>>>> extraction differences will likely stay. It's possible that a change in 
>>>> tika-eval is needed too.
>>>>
>>>> Tilman
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 12.03.2021 um 14:05 schrieb Tim Allison:
> Many, many thanks to Tilman for running the regression tests!
>
> The 2 new exceptions are caused by PDFBOX-5127.  I'm baffled that we
> haven't seen these before, but they do require some rare
> circumstances.


Latest run:

http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_6.tar.xz

the two files are gone. (Yeah this does not formally prove that we fixed 
it, but lets just believe it does)

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Tim Allison <ta...@apache.org>.
Many, many thanks to Tilman for running the regression tests!

The 2 new exceptions are caused by PDFBOX-5127.  I'm baffled that we
haven't seen these before, but they do require some rare
circumstances.

The 1 new Tika exception is a zero-byte file exception.  This is my
fault because I changed the files between Tilman's runs.

As for XMPBox, Tilman is right that when I tried to use it many years
ago, it did not have the flexibility needed for PDFs in the wild.
See: https://lucene.472066.n3.nabble.com/DISCUSS-options-for-XMP-parsing-td4262520.html

2016 me: "I found that it fails on roughly 40% of XMPs I pulled out of
PDFs from govdocs1/commoncrawl"

Cheers,

             Tim

On Thu, Mar 11, 2021 at 1:34 PM Tilman Hausherr <TH...@t-online.de> wrote:
>
> Am 11.03.2021 um 09:00 schrieb sahyoun@fileaffairs.de:
> >> The three new exceptions weren't in earlier reports.
> >>
> >> IIRC the reason Tika uses Jempbox is because Xmpbox fails when there
> >> is
> >> a non standard schema.
> > would it make sense to add that support? If yes could we get samles of
> > various schema to support that development? Could look into that if we
> > think that's worth the effort
>
>
> Here's an example:
>
> https://issues.apache.org/jira/browse/PDFBOX-3440
>
>
> Tilman
>
>
>
> >
> > Maruan
> >
> >
> >> Tilman
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 11.03.2021 um 09:00 schrieb sahyoun@fileaffairs.de:
>> The three new exceptions weren't in earlier reports.
>>
>> IIRC the reason Tika uses Jempbox is because Xmpbox fails when there
>> is
>> a non standard schema.
> would it make sense to add that support? If yes could we get samles of
> various schema to support that development? Could look into that if we
> think that's worth the effort


Here's an example:

https://issues.apache.org/jira/browse/PDFBOX-3440


Tilman



>
> Maruan
>
>
>> Tilman
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by "sahyoun@fileaffairs.de" <sa...@fileaffairs.de>.
Am Freitag, dem 12.03.2021 um 08:15 -0500 schrieb Tim Allison:
> > would it make sense to add that support? If yes could we get samples
> > of
> > various schema to support that development? Could look into that if
> > we
> > think that's worth the effort
> 
> I think I can find some XMPs if they'd be of any use! :D

That would be great - maybe together with expected extraction results -
so I can start with proper unit tests. If you could add to

https://issues.apache.org/jira/browse/PDFBOX-5128

that would be great.

BR

> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 

-- 
-- 
Maruan Sahyoun



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Tim Allison <ta...@apache.org>.
> would it make sense to add that support? If yes could we get samples of
> various schema to support that development? Could look into that if we
> think that's worth the effort

I think I can find some XMPs if they'd be of any use! :D

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by "sahyoun@fileaffairs.de" <sa...@fileaffairs.de>.
Am Donnerstag, dem 11.03.2021 um 07:56 +0100 schrieb Tilman Hausherr:
> Am 11.03.2021 um 07:46 schrieb Andreas Lehmkuehler:
> > Am 11.03.21 um 07:24 schrieb Tilman Hausherr:
> > > new report
> > > http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_5.tar.xz
> > > 
> > > The content differences part is now the smallest ever, likely due
> > > to 
> > > my change in tika-eval (TIKA-3314) and restoring a PDFBox code 
> > > segment I accidentally deleted (PDFBOX-5115).
> > Cool!!
> > 
> > > There are three new exceptions. Two are in jempbox and one is in
> > > tika 
> > > itself so I suspect PDFBox isn't to blame. I'll look at it too if
> > > I 
> > > have the time.
> > As far as I remember the jempbox issue isn't new, Tim mentioned it 
> > some time ago. Just out of curiosity does it make sense to use an
> > old 
> > lib to extract metadata? Is there anything missing in xmpbox but 
> > available in jempbox?
> > 
> The three new exceptions weren't in earlier reports.
> 
> IIRC the reason Tika uses Jempbox is because Xmpbox fails when there
> is 
> a non standard schema.

would it make sense to add that support? If yes could we get samles of
various schema to support that development? Could look into that if we
think that's worth the effort

Maruan


> 
> Tilman
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 

-- 
-- 
Maruan Sahyoun

FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen

Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
sahyoun@fileaffairs.de
www.fileaffairs.de

Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 11.03.2021 um 07:46 schrieb Andreas Lehmkuehler:
> Am 11.03.21 um 07:24 schrieb Tilman Hausherr:
>> new report
>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_5.tar.xz
>>
>> The content differences part is now the smallest ever, likely due to 
>> my change in tika-eval (TIKA-3314) and restoring a PDFBox code 
>> segment I accidentally deleted (PDFBOX-5115).
> Cool!!
>
>> There are three new exceptions. Two are in jempbox and one is in tika 
>> itself so I suspect PDFBox isn't to blame. I'll look at it too if I 
>> have the time.
> As far as I remember the jempbox issue isn't new, Tim mentioned it 
> some time ago. Just out of curiosity does it make sense to use an old 
> lib to extract metadata? Is there anything missing in xmpbox but 
> available in jempbox?
>
The three new exceptions weren't in earlier reports.

IIRC the reason Tika uses Jempbox is because Xmpbox fails when there is 
a non standard schema.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 11.03.21 um 07:24 schrieb Tilman Hausherr:
> new report
> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_5.tar.xz
> 
> The content differences part is now the smallest ever, likely due to my change 
> in tika-eval (TIKA-3314) and restoring a PDFBox code segment I accidentally 
> deleted (PDFBOX-5115).
Cool!!

> There are three new exceptions. Two are in jempbox and one is in tika itself so 
> I suspect PDFBox isn't to blame. I'll look at it too if I have the time.
As far as I remember the jempbox issue isn't new, Tim mentioned it some time 
ago. Just out of curiosity does it make sense to use an old lib to extract 
metadata? Is there anything missing in xmpbox but available in jempbox?


Andreas

> 
> Tilman
> 
> 
> Am 08.03.2021 um 11:17 schrieb Tilman Hausherr:
>> new report:
>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_3.tar.xz
>>
>> Tilman
>>
>> Am 08.03.2021 um 10:35 schrieb Tilman Hausherr:
>>> I think we're good (despite the differences, most of which are because of the 
>>> soft hyphen), but I'm now experimenting with a modified version of tika-eval 
>>> to see what happens.
>>>
>>> Tilman
>>>
>>> Am 07.03.2021 um 19:47 schrieb Tilman Hausherr:
>>>> new report at
>>>>
>>>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_2.tar.xz
>>>>
>>>> Tilman
>>>>
>>>> Am 07.03.2021 um 11:43 schrieb Tilman Hausherr:
>>>>> Am 07.03.2021 um 06:04 schrieb Tilman Hausherr:
>>>>>> Report is here:
>>>>>>
>>>>>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23.tar.xz 
>>>>>
>>>>>
>>>>> There's not much changed. No new exceptions. Re content, the changes that 
>>>>> seem important are all related to "soft hyphen".
>>>>>
>>>>> https://issues.apache.org/jira/browse/PDFBOX-5115
>>>>>
>>>>> I am currently fixing this, and then I'll run the tests again. The text 
>>>>> extraction differences will likely stay. It's possible that a change in 
>>>>> tika-eval is needed too.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Tilman Hausherr <TH...@t-online.de>.
new report
http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_5.tar.xz

The content differences part is now the smallest ever, likely due to my 
change in tika-eval (TIKA-3314) and restoring a PDFBox code segment I 
accidentally deleted (PDFBOX-5115).

There are three new exceptions. Two are in jempbox and one is in tika 
itself so I suspect PDFBox isn't to blame. I'll look at it too if I have 
the time.

Tilman


Am 08.03.2021 um 11:17 schrieb Tilman Hausherr:
> new report:
> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_3.tar.xz
>
> Tilman
>
> Am 08.03.2021 um 10:35 schrieb Tilman Hausherr:
>> I think we're good (despite the differences, most of which are 
>> because of the soft hyphen), but I'm now experimenting with a 
>> modified version of tika-eval to see what happens.
>>
>> Tilman
>>
>> Am 07.03.2021 um 19:47 schrieb Tilman Hausherr:
>>> new report at
>>>
>>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_2.tar.xz 
>>>
>>>
>>> Tilman
>>>
>>> Am 07.03.2021 um 11:43 schrieb Tilman Hausherr:
>>>> Am 07.03.2021 um 06:04 schrieb Tilman Hausherr:
>>>>> Report is here:
>>>>>
>>>>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23.tar.xz 
>>>>
>>>>
>>>> There's not much changed. No new exceptions. Re content, the 
>>>> changes that seem important are all related to "soft hyphen".
>>>>
>>>> https://issues.apache.org/jira/browse/PDFBOX-5115
>>>>
>>>> I am currently fixing this, and then I'll run the tests again. The 
>>>> text extraction differences will likely stay. It's possible that a 
>>>> change in tika-eval is needed too.
>>>>
>>>> Tilman
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Tilman Hausherr <TH...@t-online.de>.
new report:
http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_3.tar.xz

Tilman

Am 08.03.2021 um 10:35 schrieb Tilman Hausherr:
> I think we're good (despite the differences, most of which are because 
> of the soft hyphen), but I'm now experimenting with a modified version 
> of tika-eval to see what happens.
>
> Tilman
>
> Am 07.03.2021 um 19:47 schrieb Tilman Hausherr:
>> new report at
>>
>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_2.tar.xz
>>
>> Tilman
>>
>> Am 07.03.2021 um 11:43 schrieb Tilman Hausherr:
>>> Am 07.03.2021 um 06:04 schrieb Tilman Hausherr:
>>>> Report is here:
>>>>
>>>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23.tar.xz 
>>>
>>> There's not much changed. No new exceptions. Re content, the changes 
>>> that seem important are all related to "soft hyphen".
>>>
>>> https://issues.apache.org/jira/browse/PDFBOX-5115
>>>
>>> I am currently fixing this, and then I'll run the tests again. The 
>>> text extraction differences will likely stay. It's possible that a 
>>> change in tika-eval is needed too.
>>>
>>> Tilman
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Tilman Hausherr <TH...@t-online.de>.
I think we're good (despite the differences, most of which are because 
of the soft hyphen), but I'm now experimenting with a modified version 
of tika-eval to see what happens.

Tilman

Am 07.03.2021 um 19:47 schrieb Tilman Hausherr:
> new report at
>
> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_2.tar.xz
>
> Tilman
>
> Am 07.03.2021 um 11:43 schrieb Tilman Hausherr:
>> Am 07.03.2021 um 06:04 schrieb Tilman Hausherr:
>>> Report is here:
>>>
>>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23.tar.xz 
>>
>> There's not much changed. No new exceptions. Re content, the changes 
>> that seem important are all related to "soft hyphen".
>>
>> https://issues.apache.org/jira/browse/PDFBOX-5115
>>
>> I am currently fixing this, and then I'll run the tests again. The 
>> text extraction differences will likely stay. It's possible that a 
>> change in tika-eval is needed too.
>>
>> Tilman
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Tilman Hausherr <TH...@t-online.de>.
new report at

http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23_2.tar.xz

Tilman

Am 07.03.2021 um 11:43 schrieb Tilman Hausherr:
> Am 07.03.2021 um 06:04 schrieb Tilman Hausherr:
>> Report is here:
>>
>> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23.tar.xz 
>
> There's not much changed. No new exceptions. Re content, the changes 
> that seem important are all related to "soft hyphen".
>
> https://issues.apache.org/jira/browse/PDFBOX-5115
>
> I am currently fixing this, and then I'll run the tests again. The 
> text extraction differences will likely stay. It's possible that a 
> change in tika-eval is needed too.
>
> Tilman
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: 2.0.22 vs 2.0.23

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 07.03.2021 um 06:04 schrieb Tilman Hausherr:
> Report is here:
>
> http://home.snafu.de/tilman/tmp/reports_pdfbox_2.0.22_vs_2.0.23.tar.xz 

There's not much changed. No new exceptions. Re content, the changes 
that seem important are all related to "soft hyphen".

https://issues.apache.org/jira/browse/PDFBOX-5115

I am currently fixing this, and then I'll run the tests again. The text 
extraction differences will likely stay. It's possible that a change in 
tika-eval is needed too.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org