You are viewing a plain text version of this content. The canonical link for it is here.
Posted to corpora-dev@tika.apache.org by Andreas Lehmkuehler <an...@lehmi.de> on 2020/08/02 13:09:14 UTC

Re: PDFBox regression tests?

Am 28.07.20 um 23:51 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
> 
> Looks like extraction improved slightly.  I found a bug at the Tika level
> that is creating a few more exceptions (will fix soon), but this is not a
> problem for PDFBox.
> 
> I was able to turn back on our unit test that counted characters and
> non-unicode mapped characters.
> 
> I'll look a bit tomorrow, but this looks good to me.
Do you some time to rerun those tests using the latest SNAPSHOT?

Andreas

> 
> Again, many thanks to Maruan!  The processing speeds were, um, much, much
> faster.
> 
> Best,
> 
>         Tim
> 
> On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> 
>> Yes, please
>>
>> Thanks in advance!
>>
>> Am 28.07.20 um 12:45 schrieb Tim Allison:
>>> Y. I can run these today
>>>
>>> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>>>> SNAPSHOT) on
>>>> our new box? Does anyone had the cycles to prepare something ready to
>>>> start?
>>>>
>>>> If not, is there anything I can do to help? I'm planning to cut a new
>>>> PDFBox
>>>> release soon.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>
>>
>>
>