You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Tim Allison <ta...@apache.org> on 2020/07/28 21:51:36 UTC

Re: PDFBox regression tests?

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz

Looks like extraction improved slightly.  I found a bug at the Tika level
that is creating a few more exceptions (will fix soon), but this is not a
problem for PDFBox.

I was able to turn back on our unit test that counted characters and
non-unicode mapped characters.

I'll look a bit tomorrow, but this looks good to me.

Again, many thanks to Maruan!  The processing speeds were, um, much, much
faster.

Best,

       Tim

On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Yes, please
>
> Thanks in advance!
>
> Am 28.07.20 um 12:45 schrieb Tim Allison:
> > Y. I can run these today
> >
> > On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
> > wrote:
> >
> >> Hi,
> >>
> >> is there any chance to run the PDFBox regression tests (2.0.20 vs.
> >> SNAPSHOT) on
> >> our new box? Does anyone had the cycles to prepare something ready to
> >> start?
> >>
> >> If not, is there anything I can do to help? I'm planning to cut a new
> >> PDFBox
> >> release soon.
> >>
> >> Cheers
> >> Andreas
> >>
> >
>
>

Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 28.07.20 um 23:51 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
> 
> Looks like extraction improved slightly.  I found a bug at the Tika level
> that is creating a few more exceptions (will fix soon), but this is not a
> problem for PDFBox.
> 
> I was able to turn back on our unit test that counted characters and
> non-unicode mapped characters.
> 
> I'll look a bit tomorrow, but this looks good to me.
Do you some time to rerun those tests using the latest SNAPSHOT?

Andreas

> 
> Again, many thanks to Maruan!  The processing speeds were, um, much, much
> faster.
> 
> Best,
> 
>         Tim
> 
> On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> 
>> Yes, please
>>
>> Thanks in advance!
>>
>> Am 28.07.20 um 12:45 schrieb Tim Allison:
>>> Y. I can run these today
>>>
>>> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>>>> SNAPSHOT) on
>>>> our new box? Does anyone had the cycles to prepare something ready to
>>>> start?
>>>>
>>>> If not, is there anything I can do to help? I'm planning to cut a new
>>>> PDFBox
>>>> release soon.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>
>>
>>
> 


Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 29.07.20 um 17:33 schrieb Andreas Lehmkuehler:
> Am 28.07.20 um 23:51 schrieb Tim Allison:
>> Reports are here:
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>
>> Looks like extraction improved slightly.  I found a bug at the Tika level
>> that is creating a few more exceptions (will fix soon), but this is not a
>> problem for PDFBox.
> There are some new PDFBox relate3d exceptions. 3 of them are already in 2.0.21 
> (the AES and the CodespaceRange issue). I'm investigating the others.
Those 3 issues are already in 2.0.20

The others are solved now. There was a problem with the ordering of offsets in 
an object stream. According to the spec those shall be sorted, but they aren't, 
see PDFBOX-4927 for details.

>> I was able to turn back on our unit test that counted characters and
>> non-unicode mapped characters.
>>
>> I'll look a bit tomorrow, but this looks good to me.
> @Tim thanks again for running those tests. I've stumbled upon one minor glitch 
> in your reports. There are two sheets about parse time. The overall report 
> parse_time_millis_by_mime_compared.xlsx states that the pdf parsing time has 
> decreased to 88% but if I've a look in the details report 
> parse_time_millis_details.xlsx is looks the parsing time increases.
> 
> Am I mistaken or is there a glitch in your report (swapped columns)?
> 
> 
>>
>> Again, many thanks to Maruan!  The processing speeds were, um, much, much
>> faster.
>>
>> Best,
>>
>>         Tim
>>
>> On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
>> wrote:
>>
>>> Yes, please
>>>
>>> Thanks in advance!
>>>
>>> Am 28.07.20 um 12:45 schrieb Tim Allison:
>>>> Y. I can run these today
>>>>
>>>> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>>>>> SNAPSHOT) on
>>>>> our new box? Does anyone had the cycles to prepare something ready to
>>>>> start?
>>>>>
>>>>> If not, is there anything I can do to help? I'm planning to cut a new
>>>>> PDFBox
>>>>> release soon.
>>>>>
>>>>> Cheers
>>>>> Andreas
>>>>>
>>>>
>>>
>>>
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 28.07.20 um 23:51 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
> 
> Looks like extraction improved slightly.  I found a bug at the Tika level
> that is creating a few more exceptions (will fix soon), but this is not a
> problem for PDFBox.
There are some new PDFBox relate3d exceptions. 3 of them are already in 2.0.21 
(the AES and the CodespaceRange issue). I'm investigating the others.


> I was able to turn back on our unit test that counted characters and
> non-unicode mapped characters.
> 
> I'll look a bit tomorrow, but this looks good to me.
@Tim thanks again for running those tests. I've stumbled upon one minor glitch 
in your reports. There are two sheets about parse time. The overall report 
parse_time_millis_by_mime_compared.xlsx states that the pdf parsing time has 
decreased to 88% but if I've a look in the details report 
parse_time_millis_details.xlsx is looks the parsing time increases.

Am I mistaken or is there a glitch in your report (swapped columns)?


> 
> Again, many thanks to Maruan!  The processing speeds were, um, much, much
> faster.
> 
> Best,
> 
>         Tim
> 
> On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> 
>> Yes, please
>>
>> Thanks in advance!
>>
>> Am 28.07.20 um 12:45 schrieb Tim Allison:
>>> Y. I can run these today
>>>
>>> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>>>> SNAPSHOT) on
>>>> our new box? Does anyone had the cycles to prepare something ready to
>>>> start?
>>>>
>>>> If not, is there anything I can do to help? I'm planning to cut a new
>>>> PDFBox
>>>> release soon.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 12.08.20 um 23:21 schrieb Tim Allison:
> All,
>    Apologies for my delay...
> 
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz
Thanks!

> I haven't had a chance to look at the reports yet. :(
I had a look at the new exceptions and both are already in 2.0.20. I can't see 
any showstopper here.

Andreas
> I tried to update the instructions for running the process on the vm.
> Please let me know if you have any questions, or if I need to make
> improvements.
> 
> Thank you.
> 
>     Best,
> 
>                Tim
> 
> On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> 
>> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
>>> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>>>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>>>> I've looked at all the files I had highlighted yesterday. All
>> differences
>>>>> except two are related to the metadata problem.
>>>>>
>>>>> The other two have a problem with spaces, i.e. glyphs not being near
>> each other.
>>>>>
>>>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>>>
>>>>> This doesn't have to be a bug, I've seen many files where the
>> extraction is
>>>>> better, so whatever change there is may have improved more things.
>>>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
>> we?
>>>
>>>
>>> Yeah we could.
>>>
>>> But if the bug gets solved it would be nice to have a new diff output to
>> see if
>>> anything else gets shown more clearly.
>> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
>> anything
>> else we have to wait before we run the tests again, maybe some tika fix?
>>
>> Andreas
>>
>>> Tilman
>>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Tilman
>>>>>
>>>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>>>> Hi,
>>>>>>
>>>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>>>
>>>>>> There's something with the XMP metadata extraction. dc:title: is
>> empty (or
>>>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>>>
>>>>>> I thought this could be related to some minor xmpbox changes but tika
>>>>>> doesn't use it. So I searched and found some changes in
>> PDMetadataExtractor.
>>>>>>
>>>>>> I'm not yet sure if that is the cause, although I played around with
>> that one.
>>>>>>
>>>>>> If it is, then it is related to
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>>>> Looks like there may be some issues with Japanese...don't know if
>> this is
>>>>>>> related to your observation?
>>>>>>>
>>>>>>> It feels like when I sort by ascending order of
>>>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
>> pairs
>>>>>>> in the "lost common tokens".
>>>>>>>
>>>>>>> Will look a bit more.
>>>>>>>
>>>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>>>> Reports are here:
>>>>>>>>>
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>>>
>>>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
>> has
>>>>>>>> meaningful content, that is suspicious and needs further
>> investigation.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 12.08.2020 um 23:21 schrieb Tim Allison:
> All,
>    Apologies for my delay...
>
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz
>
> I haven't had a chance to look at the reports yet. :(
>
> I tried to update the instructions for running the process on the vm.
> Please let me know if you have any questions, or if I need to make
> improvements.


Thanks, it looks good that there are no show stoppers, i.e. thumbs up 
from me.

But the tika problem is still there, although less. It is gone for the 
file I mentioned (or the file wasn't in the test), but not for others, e.g.

commoncrawl3/6V/6VTB5IUKXBFA3JZPJBUPVSRY7L56K6LE

commoncrawl3/5I/5I6STZEO5W25GPETYGLCDLIB6OKQXCIG

Tilman



>
> Thank you.
>
>     Best,
>
>                Tim
>
> On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
>
>> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
>>> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>>>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>>>> I've looked at all the files I had highlighted yesterday. All
>> differences
>>>>> except two are related to the metadata problem.
>>>>>
>>>>> The other two have a problem with spaces, i.e. glyphs not being near
>> each other.
>>>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>>>
>>>>> This doesn't have to be a bug, I've seen many files where the
>> extraction is
>>>>> better, so whatever change there is may have improved more things.
>>>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
>> we?
>>>
>>> Yeah we could.
>>>
>>> But if the bug gets solved it would be nice to have a new diff output to
>> see if
>>> anything else gets shown more clearly.
>> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
>> anything
>> else we have to wait before we run the tests again, maybe some tika fix?
>>
>> Andreas
>>
>>> Tilman
>>>
>>>
>>>
>>>>
>>>>> Tilman
>>>>>
>>>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>>>> Hi,
>>>>>>
>>>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>>>
>>>>>> There's something with the XMP metadata extraction. dc:title: is
>> empty (or
>>>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>>>
>>>>>> I thought this could be related to some minor xmpbox changes but tika
>>>>>> doesn't use it. So I searched and found some changes in
>> PDMetadataExtractor.
>>>>>> I'm not yet sure if that is the cause, although I played around with
>> that one.
>>>>>> If it is, then it is related to
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>>>> Looks like there may be some issues with Japanese...don't know if
>> this is
>>>>>>> related to your observation?
>>>>>>>
>>>>>>> It feels like when I sort by ascending order of
>>>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
>> pairs
>>>>>>> in the "lost common tokens".
>>>>>>>
>>>>>>> Will look a bit more.
>>>>>>>
>>>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>>>> Reports are here:
>>>>>>>>>
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
>> has
>>>>>>>> meaningful content, that is suspicious and needs further
>> investigation.
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: PDFBox regression tests?

Posted by Tim Allison <ta...@apache.org>.
All,
  Apologies for my delay...

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz

I haven't had a chance to look at the reports yet. :(

I tried to update the instructions for running the process on the vm.
Please let me know if you have any questions, or if I need to make
improvements.

Thank you.

   Best,

              Tim

On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
> > Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
> >> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
> >>> I've looked at all the files I had highlighted yesterday. All
> differences
> >>> except two are related to the metadata problem.
> >>>
> >>> The other two have a problem with spaces, i.e. glyphs not being near
> each other.
> >>>
> >>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
> >>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
> >>>
> >>> This doesn't have to be a bug, I've seen many files where the
> extraction is
> >>> better, so whatever change there is may have improved more things.
> >> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
> we?
> >
> >
> > Yeah we could.
> >
> > But if the bug gets solved it would be nice to have a new diff output to
> see if
> > anything else gets shown more clearly.
> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
> anything
> else we have to wait before we run the tests again, maybe some tika fix?
>
> Andreas
>
> > Tilman
> >
> >
> >
> >>
> >>
> >>>
> >>> Tilman
> >>>
> >>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> >>>> Hi,
> >>>>
> >>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
> >>>>
> >>>> There's something with the XMP metadata extraction. dc:title: is
> empty (or
> >>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
> >>>>
> >>>> I thought this could be related to some minor xmpbox changes but tika
> >>>> doesn't use it. So I searched and found some changes in
> PDMetadataExtractor.
> >>>>
> >>>> I'm not yet sure if that is the cause, although I played around with
> that one.
> >>>>
> >>>> If it is, then it is related to
> >>>>
> >>>> https://issues.apache.org/jira/browse/TIKA-3101
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
> >>>>> Looks like there may be some issues with Japanese...don't know if
> this is
> >>>>> related to your observation?
> >>>>>
> >>>>> It feels like when I sort by ascending order of
> >>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
> pairs
> >>>>> in the "lost common tokens".
> >>>>>
> >>>>> Will look a bit more.
> >>>>>
> >>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
> THausherr@t-online.de>
> >>>>> wrote:
> >>>>>
> >>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
> >>>>>>> Reports are here:
> >>>>>>>
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
> >>>>>>
> >>>>>> Thank you. Besides the exceptions, there are a few cases in content
> >>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
> has
> >>>>>> meaningful content, that is suspicious and needs further
> investigation.
> >>>>>>
> >>>>>> Tilman
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>> I've looked at all the files I had highlighted yesterday. All differences 
>>> except two are related to the metadata problem.
>>>
>>> The other two have a problem with spaces, i.e. glyphs not being near each other.
>>>
>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>
>>> This doesn't have to be a bug, I've seen many files where the extraction is 
>>> better, so whatever change there is may have improved more things.
>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't we?
> 
> 
> Yeah we could.
> 
> But if the bug gets solved it would be nice to have a new diff output to see if 
> anything else gets shown more clearly.
I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there anything 
else we have to wait before we run the tests again, maybe some tika fix?

Andreas

> Tilman
> 
> 
> 
>>
>>
>>>
>>> Tilman
>>>
>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>> Hi,
>>>>
>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>
>>>> There's something with the XMP metadata extraction. dc:title: is empty (or 
>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>
>>>> I thought this could be related to some minor xmpbox changes but tika 
>>>> doesn't use it. So I searched and found some changes in PDMetadataExtractor.
>>>>
>>>> I'm not yet sure if that is the cause, although I played around with that one.
>>>>
>>>> If it is, then it is related to
>>>>
>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>
>>>> Tilman
>>>>
>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>> Looks like there may be some issues with Japanese...don't know if this is
>>>>> related to your observation?
>>>>>
>>>>> It feels like when I sort by ascending order of
>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
>>>>> in the "lost common tokens".
>>>>>
>>>>> Will look a bit more.
>>>>>
>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>>>>> wrote:
>>>>>
>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>> Reports are here:
>>>>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>
>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>>>>> meaningful content, that is suspicious and needs further investigation.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>> I've looked at all the files I had highlighted yesterday. All 
>> differences except two are related to the metadata problem.
>>
>> The other two have a problem with spaces, i.e. glyphs not being near 
>> each other.
>>
>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>
>> This doesn't have to be a bug, I've seen many files where the 
>> extraction is better, so whatever change there is may have improved 
>> more things.
> Thanks, for the analysis. IMHO we are good to cut a new release, 
> aren't we?


Yeah we could.

But if the bug gets solved it would be nice to have a new diff output to 
see if anything else gets shown more clearly.

Tilman



>
>
>>
>> Tilman
>>
>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>> Hi,
>>>
>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>
>>> There's something with the XMP metadata extraction. dc:title: is 
>>> empty (or an empty line and maybe spaces) in tika 1.25 but not in 
>>> tika 1.24.
>>>
>>> I thought this could be related to some minor xmpbox changes but 
>>> tika doesn't use it. So I searched and found some changes in 
>>> PDMetadataExtractor.
>>>
>>> I'm not yet sure if that is the cause, although I played around with 
>>> that one.
>>>
>>> If it is, then it is related to
>>>
>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>
>>> Tilman
>>>
>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>> Looks like there may be some issues with Japanese...don't know if 
>>>> this is
>>>> related to your observation?
>>>>
>>>> It feels like when I sort by ascending order of
>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn 
>>>> language pairs
>>>> in the "lost common tokens".
>>>>
>>>> Will look a bit more.
>>>>
>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr 
>>>> <TH...@t-online.de>
>>>> wrote:
>>>>
>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>> Reports are here:
>>>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz 
>>>>>>
>>>>>
>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>> extraction where "TOP_10_MORE_IN_B" is empty and 
>>>>> "TOP_10_MORE_IN_A" has
>>>>> meaningful content, that is suspicious and needs further 
>>>>> investigation.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
> I've looked at all the files I had highlighted yesterday. All differences except 
> two are related to the metadata problem.
> 
> The other two have a problem with spaces, i.e. glyphs not being near each other.
> 
> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
> 
> This doesn't have to be a bug, I've seen many files where the extraction is 
> better, so whatever change there is may have improved more things.
Thanks, for the analysis. IMHO we are good to cut a new release, aren't we?


> 
> Tilman
> 
> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>> Hi,
>>
>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>
>> There's something with the XMP metadata extraction. dc:title: is empty (or an 
>> empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>
>> I thought this could be related to some minor xmpbox changes but tika doesn't 
>> use it. So I searched and found some changes in PDMetadataExtractor.
>>
>> I'm not yet sure if that is the cause, although I played around with that one.
>>
>> If it is, then it is related to
>>
>> https://issues.apache.org/jira/browse/TIKA-3101
>>
>> Tilman
>>
>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>> Looks like there may be some issues with Japanese...don't know if this is
>>> related to your observation?
>>>
>>> It feels like when I sort by ascending order of
>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
>>> in the "lost common tokens".
>>>
>>> Will look a bit more.
>>>
>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>> Reports are here:
>>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>
>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>>> meaningful content, that is suspicious and needs further investigation.
>>>>
>>>> Tilman
>>>>
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.
I've looked at all the files I had highlighted yesterday. All 
differences except two are related to the metadata problem.

The other two have a problem with spaces, i.e. glyphs not being near 
each other.

commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ

This doesn't have to be a bug, I've seen many files where the extraction 
is better, so whatever change there is may have improved more things.

Tilman

Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> Hi,
>
> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>
> There's something with the XMP metadata extraction. dc:title: is empty 
> (or an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>
> I thought this could be related to some minor xmpbox changes but tika 
> doesn't use it. So I searched and found some changes in 
> PDMetadataExtractor.
>
> I'm not yet sure if that is the cause, although I played around with 
> that one.
>
> If it is, then it is related to
>
> https://issues.apache.org/jira/browse/TIKA-3101
>
> Tilman
>
> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>> Looks like there may be some issues with Japanese...don't know if 
>> this is
>> related to your observation?
>>
>> It feels like when I sort by ascending order of
>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language 
>> pairs
>> in the "lost common tokens".
>>
>> Will look a bit more.
>>
>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>> Reports are here:
>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz 
>>>>
>>>
>>> Thank you. Besides the exceptions, there are a few cases in content
>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>> meaningful content, that is suspicious and needs further investigation.
>>>
>>> Tilman
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.
I got a bit closer. IMHO it happens here:

     private static void setNotNull(Property property, String value, 
Metadata metadata) {
         if (metadata.get(property) == null && ! 
StringUtils.isEmpty(value)) {
             metadata.set(property, value);
         }
     }

if "value" is not empty but only spaces then the problem happens.

The PDF has a buggy XMP so you get no title from DublinCore but some 
title from Basic. However this "some title" from Basic is just spaces 
(which may or may not be a bug) and shouldn't be used. If this is 
skipped then we have the old behavior.

Tilman

Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> Hi,
>
> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>
> There's something with the XMP metadata extraction. dc:title: is empty 
> (or an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>
> I thought this could be related to some minor xmpbox changes but tika 
> doesn't use it. So I searched and found some changes in 
> PDMetadataExtractor.
>
> I'm not yet sure if that is the cause, although I played around with 
> that one.
>
> If it is, then it is related to
>
> https://issues.apache.org/jira/browse/TIKA-3101
>
> Tilman
>
> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>> Looks like there may be some issues with Japanese...don't know if 
>> this is
>> related to your observation?
>>
>> It feels like when I sort by ascending order of
>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language 
>> pairs
>> in the "lost common tokens".
>>
>> Will look a bit more.
>>
>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>> Reports are here:
>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz 
>>>>
>>>
>>> Thank you. Besides the exceptions, there are a few cases in content
>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>> meaningful content, that is suspicious and needs further investigation.
>>>
>>> Tilman
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,

I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf

There's something with the XMP metadata extraction. dc:title: is empty 
(or an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.

I thought this could be related to some minor xmpbox changes but tika 
doesn't use it. So I searched and found some changes in PDMetadataExtractor.

I'm not yet sure if that is the cause, although I played around with 
that one.

If it is, then it is related to

https://issues.apache.org/jira/browse/TIKA-3101

Tilman

Am 30.07.2020 um 12:43 schrieb Tim Allison:
> Looks like there may be some issues with Japanese...don't know if this is
> related to your observation?
>
> It feels like when I sort by ascending order of
> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
> in the "lost common tokens".
>
> Will look a bit more.
>
> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>> Reports are here:
>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>
>> Thank you. Besides the exceptions, there are a few cases in content
>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>> meaningful content, that is suspicious and needs further investigation.
>>
>> Tilman
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org


Re: PDFBox regression tests?

Posted by Tim Allison <ta...@apache.org>.
Looks like there may be some issues with Japanese...don't know if this is
related to your observation?

It feels like when I sort by ascending order of
NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
in the "lost common tokens".

Will look a bit more.

On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
wrote:

> Am 28.07.2020 um 23:51 schrieb Tim Allison:
> > Reports are here:
> > https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>
>
> Thank you. Besides the exceptions, there are a few cases in content
> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
> meaningful content, that is suspicious and needs further investigation.
>
> Tilman
>
>

Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 28.07.2020 um 23:51 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz


Thank you. Besides the exceptions, there are a few cases in content 
extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has 
meaningful content, that is suspicious and needs further investigation.

Tilman