You are viewing a plain text version of this content. The canonical link for it is here.

Posted to corpora-dev@tika.apache.org by Andreas Lehmkuehler <an...@lehmi.de> on 2020/07/28 06:58:11 UTC

PDFBox regression tests?

Hi,

is there any chance to run the PDFBox regression tests (2.0.20 vs. SNAPSHOT) on 
our new box? Does anyone had the cycles to prepare something ready to start?

If not, is there anything I can do to help? I'm planning to cut a new PDFBox 
release soon.

Cheers
Andreas

Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 28.07.20 um 23:51 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
> 
> Looks like extraction improved slightly.  I found a bug at the Tika level
> that is creating a few more exceptions (will fix soon), but this is not a
> problem for PDFBox.
> 
> I was able to turn back on our unit test that counted characters and
> non-unicode mapped characters.
> 
> I'll look a bit tomorrow, but this looks good to me.
Do you some time to rerun those tests using the latest SNAPSHOT?

Andreas

> 
> Again, many thanks to Maruan!  The processing speeds were, um, much, much
> faster.
> 
> Best,
> 
>         Tim
> 
> On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> 
>> Yes, please
>>
>> Thanks in advance!
>>
>> Am 28.07.20 um 12:45 schrieb Tim Allison:
>>> Y. I can run these today
>>>
>>> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>>>> SNAPSHOT) on
>>>> our new box? Does anyone had the cycles to prepare something ready to
>>>> start?
>>>>
>>>> If not, is there anything I can do to help? I'm planning to cut a new
>>>> PDFBox
>>>> release soon.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>
>>
>>
>

Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 29.07.20 um 17:33 schrieb Andreas Lehmkuehler:
> Am 28.07.20 um 23:51 schrieb Tim Allison:
>> Reports are here:
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>
>> Looks like extraction improved slightly.  I found a bug at the Tika level
>> that is creating a few more exceptions (will fix soon), but this is not a
>> problem for PDFBox.
> There are some new PDFBox relate3d exceptions. 3 of them are already in 2.0.21 
> (the AES and the CodespaceRange issue). I'm investigating the others.
Those 3 issues are already in 2.0.20

The others are solved now. There was a problem with the ordering of offsets in 
an object stream. According to the spec those shall be sorted, but they aren't, 
see PDFBOX-4927 for details.

>> I was able to turn back on our unit test that counted characters and
>> non-unicode mapped characters.
>>
>> I'll look a bit tomorrow, but this looks good to me.
> @Tim thanks again for running those tests. I've stumbled upon one minor glitch 
> in your reports. There are two sheets about parse time. The overall report 
> parse_time_millis_by_mime_compared.xlsx states that the pdf parsing time has 
> decreased to 88% but if I've a look in the details report 
> parse_time_millis_details.xlsx is looks the parsing time increases.
> 
> Am I mistaken or is there a glitch in your report (swapped columns)?
> 
> 
>>
>> Again, many thanks to Maruan!  The processing speeds were, um, much, much
>> faster.
>>
>> Best,
>>
>>         Tim
>>
>> On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
>> wrote:
>>
>>> Yes, please
>>>
>>> Thanks in advance!
>>>
>>> Am 28.07.20 um 12:45 schrieb Tim Allison:
>>>> Y. I can run these today
>>>>
>>>> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>>>>> SNAPSHOT) on
>>>>> our new box? Does anyone had the cycles to prepare something ready to
>>>>> start?
>>>>>
>>>>> If not, is there anything I can do to help? I'm planning to cut a new
>>>>> PDFBox
>>>>> release soon.
>>>>>
>>>>> Cheers
>>>>> Andreas
>>>>>
>>>>
>>>
>>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 28.07.20 um 23:51 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
> 
> Looks like extraction improved slightly.  I found a bug at the Tika level
> that is creating a few more exceptions (will fix soon), but this is not a
> problem for PDFBox.
There are some new PDFBox relate3d exceptions. 3 of them are already in 2.0.21 
(the AES and the CodespaceRange issue). I'm investigating the others.


> I was able to turn back on our unit test that counted characters and
> non-unicode mapped characters.
> 
> I'll look a bit tomorrow, but this looks good to me.
@Tim thanks again for running those tests. I've stumbled upon one minor glitch 
in your reports. There are two sheets about parse time. The overall report 
parse_time_millis_by_mime_compared.xlsx states that the pdf parsing time has 
decreased to 88% but if I've a look in the details report 
parse_time_millis_details.xlsx is looks the parsing time increases.

Am I mistaken or is there a glitch in your report (swapped columns)?


> 
> Again, many thanks to Maruan!  The processing speeds were, um, much, much
> faster.
> 
> Best,
> 
>         Tim
> 
> On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> 
>> Yes, please
>>
>> Thanks in advance!
>>
>> Am 28.07.20 um 12:45 schrieb Tim Allison:
>>> Y. I can run these today
>>>
>>> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>>>> SNAPSHOT) on
>>>> our new box? Does anyone had the cycles to prepare something ready to
>>>> start?
>>>>
>>>> If not, is there anything I can do to help? I'm planning to cut a new
>>>> PDFBox
>>>> release soon.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 12.08.20 um 23:21 schrieb Tim Allison:
> All,
>    Apologies for my delay...
> 
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz
Thanks!

> I haven't had a chance to look at the reports yet. :(
I had a look at the new exceptions and both are already in 2.0.20. I can't see 
any showstopper here.

Andreas
> I tried to update the instructions for running the process on the vm.
> Please let me know if you have any questions, or if I need to make
> improvements.
> 
> Thank you.
> 
>     Best,
> 
>                Tim
> 
> On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> 
>> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
>>> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>>>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>>>> I've looked at all the files I had highlighted yesterday. All
>> differences
>>>>> except two are related to the metadata problem.
>>>>>
>>>>> The other two have a problem with spaces, i.e. glyphs not being near
>> each other.
>>>>>
>>>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>>>
>>>>> This doesn't have to be a bug, I've seen many files where the
>> extraction is
>>>>> better, so whatever change there is may have improved more things.
>>>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
>> we?
>>>
>>>
>>> Yeah we could.
>>>
>>> But if the bug gets solved it would be nice to have a new diff output to
>> see if
>>> anything else gets shown more clearly.
>> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
>> anything
>> else we have to wait before we run the tests again, maybe some tika fix?
>>
>> Andreas
>>
>>> Tilman
>>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Tilman
>>>>>
>>>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>>>> Hi,
>>>>>>
>>>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>>>
>>>>>> There's something with the XMP metadata extraction. dc:title: is
>> empty (or
>>>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>>>
>>>>>> I thought this could be related to some minor xmpbox changes but tika
>>>>>> doesn't use it. So I searched and found some changes in
>> PDMetadataExtractor.
>>>>>>
>>>>>> I'm not yet sure if that is the cause, although I played around with
>> that one.
>>>>>>
>>>>>> If it is, then it is related to
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>>>> Looks like there may be some issues with Japanese...don't know if
>> this is
>>>>>>> related to your observation?
>>>>>>>
>>>>>>> It feels like when I sort by ascending order of
>>>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
>> pairs
>>>>>>> in the "lost common tokens".
>>>>>>>
>>>>>>> Will look a bit more.
>>>>>>>
>>>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>>>> Reports are here:
>>>>>>>>>
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>>>
>>>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
>> has
>>>>>>>> meaningful content, that is suspicious and needs further
>> investigation.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 12.08.2020 um 23:21 schrieb Tim Allison:
> All,
>    Apologies for my delay...
>
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz
>
> I haven't had a chance to look at the reports yet. :(
>
> I tried to update the instructions for running the process on the vm.
> Please let me know if you have any questions, or if I need to make
> improvements.


Thanks, it looks good that there are no show stoppers, i.e. thumbs up 
from me.

But the tika problem is still there, although less. It is gone for the 
file I mentioned (or the file wasn't in the test), but not for others, e.g.

commoncrawl3/6V/6VTB5IUKXBFA3JZPJBUPVSRY7L56K6LE

commoncrawl3/5I/5I6STZEO5W25GPETYGLCDLIB6OKQXCIG

Tilman



>
> Thank you.
>
>     Best,
>
>                Tim
>
> On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
>
>> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
>>> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>>>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>>>> I've looked at all the files I had highlighted yesterday. All
>> differences
>>>>> except two are related to the metadata problem.
>>>>>
>>>>> The other two have a problem with spaces, i.e. glyphs not being near
>> each other.
>>>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>>>
>>>>> This doesn't have to be a bug, I've seen many files where the
>> extraction is
>>>>> better, so whatever change there is may have improved more things.
>>>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
>> we?
>>>
>>> Yeah we could.
>>>
>>> But if the bug gets solved it would be nice to have a new diff output to
>> see if
>>> anything else gets shown more clearly.
>> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
>> anything
>> else we have to wait before we run the tests again, maybe some tika fix?
>>
>> Andreas
>>
>>> Tilman
>>>
>>>
>>>
>>>>
>>>>> Tilman
>>>>>
>>>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>>>> Hi,
>>>>>>
>>>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>>>
>>>>>> There's something with the XMP metadata extraction. dc:title: is
>> empty (or
>>>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>>>
>>>>>> I thought this could be related to some minor xmpbox changes but tika
>>>>>> doesn't use it. So I searched and found some changes in
>> PDMetadataExtractor.
>>>>>> I'm not yet sure if that is the cause, although I played around with
>> that one.
>>>>>> If it is, then it is related to
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>>>> Looks like there may be some issues with Japanese...don't know if
>> this is
>>>>>>> related to your observation?
>>>>>>>
>>>>>>> It feels like when I sort by ascending order of
>>>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
>> pairs
>>>>>>> in the "lost common tokens".
>>>>>>>
>>>>>>> Will look a bit more.
>>>>>>>
>>>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>>>> Reports are here:
>>>>>>>>>
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
>> has
>>>>>>>> meaningful content, that is suspicious and needs further
>> investigation.
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Tim Allison <ta...@apache.org>.

All,
  Apologies for my delay...

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz

I haven't had a chance to look at the reports yet. :(

I tried to update the instructions for running the process on the vm.
Please let me know if you have any questions, or if I need to make
improvements.

Thank you.

   Best,

              Tim

On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
> > Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
> >> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
> >>> I've looked at all the files I had highlighted yesterday. All
> differences
> >>> except two are related to the metadata problem.
> >>>
> >>> The other two have a problem with spaces, i.e. glyphs not being near
> each other.
> >>>
> >>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
> >>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
> >>>
> >>> This doesn't have to be a bug, I've seen many files where the
> extraction is
> >>> better, so whatever change there is may have improved more things.
> >> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
> we?
> >
> >
> > Yeah we could.
> >
> > But if the bug gets solved it would be nice to have a new diff output to
> see if
> > anything else gets shown more clearly.
> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
> anything
> else we have to wait before we run the tests again, maybe some tika fix?
>
> Andreas
>
> > Tilman
> >
> >
> >
> >>
> >>
> >>>
> >>> Tilman
> >>>
> >>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> >>>> Hi,
> >>>>
> >>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
> >>>>
> >>>> There's something with the XMP metadata extraction. dc:title: is
> empty (or
> >>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
> >>>>
> >>>> I thought this could be related to some minor xmpbox changes but tika
> >>>> doesn't use it. So I searched and found some changes in
> PDMetadataExtractor.
> >>>>
> >>>> I'm not yet sure if that is the cause, although I played around with
> that one.
> >>>>
> >>>> If it is, then it is related to
> >>>>
> >>>> https://issues.apache.org/jira/browse/TIKA-3101
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
> >>>>> Looks like there may be some issues with Japanese...don't know if
> this is
> >>>>> related to your observation?
> >>>>>
> >>>>> It feels like when I sort by ascending order of
> >>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
> pairs
> >>>>> in the "lost common tokens".
> >>>>>
> >>>>> Will look a bit more.
> >>>>>
> >>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
> THausherr@t-online.de>
> >>>>> wrote:
> >>>>>
> >>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
> >>>>>>> Reports are here:
> >>>>>>>
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
> >>>>>>
> >>>>>> Thank you. Besides the exceptions, there are a few cases in content
> >>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
> has
> >>>>>> meaningful content, that is suspicious and needs further
> investigation.
> >>>>>>
> >>>>>> Tilman
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>> I've looked at all the files I had highlighted yesterday. All differences 
>>> except two are related to the metadata problem.
>>>
>>> The other two have a problem with spaces, i.e. glyphs not being near each other.
>>>
>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>
>>> This doesn't have to be a bug, I've seen many files where the extraction is 
>>> better, so whatever change there is may have improved more things.
>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't we?
> 
> 
> Yeah we could.
> 
> But if the bug gets solved it would be nice to have a new diff output to see if 
> anything else gets shown more clearly.
I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there anything 
else we have to wait before we run the tests again, maybe some tika fix?

Andreas

> Tilman
> 
> 
> 
>>
>>
>>>
>>> Tilman
>>>
>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>> Hi,
>>>>
>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>
>>>> There's something with the XMP metadata extraction. dc:title: is empty (or 
>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>
>>>> I thought this could be related to some minor xmpbox changes but tika 
>>>> doesn't use it. So I searched and found some changes in PDMetadataExtractor.
>>>>
>>>> I'm not yet sure if that is the cause, although I played around with that one.
>>>>
>>>> If it is, then it is related to
>>>>
>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>
>>>> Tilman
>>>>
>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>> Looks like there may be some issues with Japanese...don't know if this is
>>>>> related to your observation?
>>>>>
>>>>> It feels like when I sort by ascending order of
>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
>>>>> in the "lost common tokens".
>>>>>
>>>>> Will look a bit more.
>>>>>
>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>>>>> wrote:
>>>>>
>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>> Reports are here:
>>>>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>
>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>>>>> meaningful content, that is suspicious and needs further investigation.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>> I've looked at all the files I had highlighted yesterday. All 
>> differences except two are related to the metadata problem.
>>
>> The other two have a problem with spaces, i.e. glyphs not being near 
>> each other.
>>
>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>
>> This doesn't have to be a bug, I've seen many files where the 
>> extraction is better, so whatever change there is may have improved 
>> more things.
> Thanks, for the analysis. IMHO we are good to cut a new release, 
> aren't we?


Yeah we could.

But if the bug gets solved it would be nice to have a new diff output to 
see if anything else gets shown more clearly.

Tilman



>
>
>>
>> Tilman
>>
>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>> Hi,
>>>
>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>
>>> There's something with the XMP metadata extraction. dc:title: is 
>>> empty (or an empty line and maybe spaces) in tika 1.25 but not in 
>>> tika 1.24.
>>>
>>> I thought this could be related to some minor xmpbox changes but 
>>> tika doesn't use it. So I searched and found some changes in 
>>> PDMetadataExtractor.
>>>
>>> I'm not yet sure if that is the cause, although I played around with 
>>> that one.
>>>
>>> If it is, then it is related to
>>>
>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>
>>> Tilman
>>>
>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>> Looks like there may be some issues with Japanese...don't know if 
>>>> this is
>>>> related to your observation?
>>>>
>>>> It feels like when I sort by ascending order of
>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn 
>>>> language pairs
>>>> in the "lost common tokens".
>>>>
>>>> Will look a bit more.
>>>>
>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr 
>>>> <TH...@t-online.de>
>>>> wrote:
>>>>
>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>> Reports are here:
>>>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz 
>>>>>>
>>>>>
>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>> extraction where "TOP_10_MORE_IN_B" is empty and 
>>>>> "TOP_10_MORE_IN_A" has
>>>>> meaningful content, that is suspicious and needs further 
>>>>> investigation.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
> I've looked at all the files I had highlighted yesterday. All differences except 
> two are related to the metadata problem.
> 
> The other two have a problem with spaces, i.e. glyphs not being near each other.
> 
> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
> 
> This doesn't have to be a bug, I've seen many files where the extraction is 
> better, so whatever change there is may have improved more things.
Thanks, for the analysis. IMHO we are good to cut a new release, aren't we?


> 
> Tilman
> 
> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>> Hi,
>>
>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>
>> There's something with the XMP metadata extraction. dc:title: is empty (or an 
>> empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>
>> I thought this could be related to some minor xmpbox changes but tika doesn't 
>> use it. So I searched and found some changes in PDMetadataExtractor.
>>
>> I'm not yet sure if that is the cause, although I played around with that one.
>>
>> If it is, then it is related to
>>
>> https://issues.apache.org/jira/browse/TIKA-3101
>>
>> Tilman
>>
>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>> Looks like there may be some issues with Japanese...don't know if this is
>>> related to your observation?
>>>
>>> It feels like when I sort by ascending order of
>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
>>> in the "lost common tokens".
>>>
>>> Will look a bit more.
>>>
>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>> Reports are here:
>>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>
>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>>> meaningful content, that is suspicious and needs further investigation.
>>>>
>>>> Tilman
>>>>
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.

I've looked at all the files I had highlighted yesterday. All 
differences except two are related to the metadata problem.

The other two have a problem with spaces, i.e. glyphs not being near 
each other.

commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ

This doesn't have to be a bug, I've seen many files where the extraction 
is better, so whatever change there is may have improved more things.

Tilman

Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> Hi,
>
> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>
> There's something with the XMP metadata extraction. dc:title: is empty 
> (or an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>
> I thought this could be related to some minor xmpbox changes but tika 
> doesn't use it. So I searched and found some changes in 
> PDMetadataExtractor.
>
> I'm not yet sure if that is the cause, although I played around with 
> that one.
>
> If it is, then it is related to
>
> https://issues.apache.org/jira/browse/TIKA-3101
>
> Tilman
>
> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>> Looks like there may be some issues with Japanese...don't know if 
>> this is
>> related to your observation?
>>
>> It feels like when I sort by ascending order of
>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language 
>> pairs
>> in the "lost common tokens".
>>
>> Will look a bit more.
>>
>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>> Reports are here:
>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz 
>>>>
>>>
>>> Thank you. Besides the exceptions, there are a few cases in content
>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>> meaningful content, that is suspicious and needs further investigation.
>>>
>>> Tilman
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.

I got a bit closer. IMHO it happens here:

     private static void setNotNull(Property property, String value, 
Metadata metadata) {
         if (metadata.get(property) == null && ! 
StringUtils.isEmpty(value)) {
             metadata.set(property, value);
         }
     }

if "value" is not empty but only spaces then the problem happens.

The PDF has a buggy XMP so you get no title from DublinCore but some 
title from Basic. However this "some title" from Basic is just spaces 
(which may or may not be a bug) and shouldn't be used. If this is 
skipped then we have the old behavior.

Tilman

Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> Hi,
>
> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>
> There's something with the XMP metadata extraction. dc:title: is empty 
> (or an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>
> I thought this could be related to some minor xmpbox changes but tika 
> doesn't use it. So I searched and found some changes in 
> PDMetadataExtractor.
>
> I'm not yet sure if that is the cause, although I played around with 
> that one.
>
> If it is, then it is related to
>
> https://issues.apache.org/jira/browse/TIKA-3101
>
> Tilman
>
> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>> Looks like there may be some issues with Japanese...don't know if 
>> this is
>> related to your observation?
>>
>> It feels like when I sort by ascending order of
>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language 
>> pairs
>> in the "lost common tokens".
>>
>> Will look a bit more.
>>
>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>> Reports are here:
>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz 
>>>>
>>>
>>> Thank you. Besides the exceptions, there are a few cases in content
>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>> meaningful content, that is suspicious and needs further investigation.
>>>
>>> Tilman
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.

Hi,

I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf

There's something with the XMP metadata extraction. dc:title: is empty 
(or an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.

I thought this could be related to some minor xmpbox changes but tika 
doesn't use it. So I searched and found some changes in PDMetadataExtractor.

I'm not yet sure if that is the cause, although I played around with 
that one.

If it is, then it is related to

https://issues.apache.org/jira/browse/TIKA-3101

Tilman

Am 30.07.2020 um 12:43 schrieb Tim Allison:
> Looks like there may be some issues with Japanese...don't know if this is
> related to your observation?
>
> It feels like when I sort by ascending order of
> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
> in the "lost common tokens".
>
> Will look a bit more.
>
> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>> Reports are here:
>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>
>> Thank you. Besides the exceptions, there are a few cases in content
>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>> meaningful content, that is suspicious and needs further investigation.
>>
>> Tilman
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Tim Allison <ta...@apache.org>.

Looks like there may be some issues with Japanese...don't know if this is
related to your observation?

It feels like when I sort by ascending order of
NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
in the "lost common tokens".

Will look a bit more.

On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 28.07.2020 um 23:51 schrieb Tim Allison:
> > Reports are here:
> > https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>
>
> Thank you. Besides the exceptions, there are a few cases in content
> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
> meaningful content, that is suspicious and needs further investigation.
>
> Tilman
>
>

Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 28.07.2020 um 23:51 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz


Thank you. Besides the exceptions, there are a few cases in content 
extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has 
meaningful content, that is suspicious and needs further investigation.

Tilman

Re: PDFBox regression tests?

Posted by Tim Allison <ta...@apache.org>.

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz

Looks like extraction improved slightly.  I found a bug at the Tika level
that is creating a few more exceptions (will fix soon), but this is not a
problem for PDFBox.

I was able to turn back on our unit test that counted characters and
non-unicode mapped characters.

I'll look a bit tomorrow, but this looks good to me.

Again, many thanks to Maruan!  The processing speeds were, um, much, much
faster.

Best,

       Tim

On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Yes, please
>
> Thanks in advance!
>
> Am 28.07.20 um 12:45 schrieb Tim Allison:
> > Y. I can run these today
> >
> > On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
> > wrote:
> >
> >> Hi,
> >>
> >> is there any chance to run the PDFBox regression tests (2.0.20 vs.
> >> SNAPSHOT) on
> >> our new box? Does anyone had the cycles to prepare something ready to
> >> start?
> >>
> >> If not, is there anything I can do to help? I'm planning to cut a new
> >> PDFBox
> >> release soon.
> >>
> >> Cheers
> >> Andreas
> >>
> >
>
>

Re: PDFBox regression tests?

Posted by Tim Allison <ta...@apache.org>.

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz

Looks like extraction improved slightly.  I found a bug at the Tika level
that is creating a few more exceptions (will fix soon), but this is not a
problem for PDFBox.

I was able to turn back on our unit test that counted characters and
non-unicode mapped characters.

I'll look a bit tomorrow, but this looks good to me.

Again, many thanks to Maruan!  The processing speeds were, um, much, much
faster.

Best,

       Tim

On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Yes, please
>
> Thanks in advance!
>
> Am 28.07.20 um 12:45 schrieb Tim Allison:
> > Y. I can run these today
> >
> > On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
> > wrote:
> >
> >> Hi,
> >>
> >> is there any chance to run the PDFBox regression tests (2.0.20 vs.
> >> SNAPSHOT) on
> >> our new box? Does anyone had the cycles to prepare something ready to
> >> start?
> >>
> >> If not, is there anything I can do to help? I'm planning to cut a new
> >> PDFBox
> >> release soon.
> >>
> >> Cheers
> >> Andreas
> >>
> >
>
>

Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Yes, please

Thanks in advance!

Am 28.07.20 um 12:45 schrieb Tim Allison:
> Y. I can run these today
> 
> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> 
>> Hi,
>>
>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>> SNAPSHOT) on
>> our new box? Does anyone had the cycles to prepare something ready to
>> start?
>>
>> If not, is there anything I can do to help? I'm planning to cut a new
>> PDFBox
>> release soon.
>>
>> Cheers
>> Andreas
>>
>

Re: PDFBox regression tests?

Posted by Tim Allison <ta...@apache.org>.

Y. I can run these today

On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Hi,
>
> is there any chance to run the PDFBox regression tests (2.0.20 vs.
> SNAPSHOT) on
> our new box? Does anyone had the cycles to prepare something ready to
> start?
>
> If not, is there anything I can do to help? I'm planning to cut a new
> PDFBox
> release soon.
>
> Cheers
> Andreas
>