You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Tim Allison <ta...@apache.org> on 2020/07/28 21:51:36 UTC
Re: PDFBox regression tests?
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
Looks like extraction improved slightly. I found a bug at the Tika level
that is creating a few more exceptions (will fix soon), but this is not a
problem for PDFBox.
I was able to turn back on our unit test that counted characters and
non-unicode mapped characters.
I'll look a bit tomorrow, but this looks good to me.
Again, many thanks to Maruan! The processing speeds were, um, much, much
faster.
Best,
Tim
On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:
> Yes, please
>
> Thanks in advance!
>
> Am 28.07.20 um 12:45 schrieb Tim Allison:
> > Y. I can run these today
> >
> > On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
> > wrote:
> >
> >> Hi,
> >>
> >> is there any chance to run the PDFBox regression tests (2.0.20 vs.
> >> SNAPSHOT) on
> >> our new box? Does anyone had the cycles to prepare something ready to
> >> start?
> >>
> >> If not, is there anything I can do to help? I'm planning to cut a new
> >> PDFBox
> >> release soon.
> >>
> >> Cheers
> >> Andreas
> >>
> >
>
>
Re: PDFBox regression tests?
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 28.07.20 um 23:51 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>
> Looks like extraction improved slightly. I found a bug at the Tika level
> that is creating a few more exceptions (will fix soon), but this is not a
> problem for PDFBox.
>
> I was able to turn back on our unit test that counted characters and
> non-unicode mapped characters.
>
> I'll look a bit tomorrow, but this looks good to me.
Do you some time to rerun those tests using the latest SNAPSHOT?
Andreas
>
> Again, many thanks to Maruan! The processing speeds were, um, much, much
> faster.
>
> Best,
>
> Tim
>
> On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
>
>> Yes, please
>>
>> Thanks in advance!
>>
>> Am 28.07.20 um 12:45 schrieb Tim Allison:
>>> Y. I can run these today
>>>
>>> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>>>> SNAPSHOT) on
>>>> our new box? Does anyone had the cycles to prepare something ready to
>>>> start?
>>>>
>>>> If not, is there anything I can do to help? I'm planning to cut a new
>>>> PDFBox
>>>> release soon.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>
>>
>>
>
Re: PDFBox regression tests?
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 29.07.20 um 17:33 schrieb Andreas Lehmkuehler:
> Am 28.07.20 um 23:51 schrieb Tim Allison:
>> Reports are here:
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>
>> Looks like extraction improved slightly. I found a bug at the Tika level
>> that is creating a few more exceptions (will fix soon), but this is not a
>> problem for PDFBox.
> There are some new PDFBox relate3d exceptions. 3 of them are already in 2.0.21
> (the AES and the CodespaceRange issue). I'm investigating the others.
Those 3 issues are already in 2.0.20
The others are solved now. There was a problem with the ordering of offsets in
an object stream. According to the spec those shall be sorted, but they aren't,
see PDFBOX-4927 for details.
>> I was able to turn back on our unit test that counted characters and
>> non-unicode mapped characters.
>>
>> I'll look a bit tomorrow, but this looks good to me.
> @Tim thanks again for running those tests. I've stumbled upon one minor glitch
> in your reports. There are two sheets about parse time. The overall report
> parse_time_millis_by_mime_compared.xlsx states that the pdf parsing time has
> decreased to 88% but if I've a look in the details report
> parse_time_millis_details.xlsx is looks the parsing time increases.
>
> Am I mistaken or is there a glitch in your report (swapped columns)?
>
>
>>
>> Again, many thanks to Maruan! The processing speeds were, um, much, much
>> faster.
>>
>> Best,
>>
>> Tim
>>
>> On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
>> wrote:
>>
>>> Yes, please
>>>
>>> Thanks in advance!
>>>
>>> Am 28.07.20 um 12:45 schrieb Tim Allison:
>>>> Y. I can run these today
>>>>
>>>> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>>>>> SNAPSHOT) on
>>>>> our new box? Does anyone had the cycles to prepare something ready to
>>>>> start?
>>>>>
>>>>> If not, is there anything I can do to help? I'm planning to cut a new
>>>>> PDFBox
>>>>> release soon.
>>>>>
>>>>> Cheers
>>>>> Andreas
>>>>>
>>>>
>>>
>>>
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: PDFBox regression tests?
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 28.07.20 um 23:51 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>
> Looks like extraction improved slightly. I found a bug at the Tika level
> that is creating a few more exceptions (will fix soon), but this is not a
> problem for PDFBox.
There are some new PDFBox relate3d exceptions. 3 of them are already in 2.0.21
(the AES and the CodespaceRange issue). I'm investigating the others.
> I was able to turn back on our unit test that counted characters and
> non-unicode mapped characters.
>
> I'll look a bit tomorrow, but this looks good to me.
@Tim thanks again for running those tests. I've stumbled upon one minor glitch
in your reports. There are two sheets about parse time. The overall report
parse_time_millis_by_mime_compared.xlsx states that the pdf parsing time has
decreased to 88% but if I've a look in the details report
parse_time_millis_details.xlsx is looks the parsing time increases.
Am I mistaken or is there a glitch in your report (swapped columns)?
>
> Again, many thanks to Maruan! The processing speeds were, um, much, much
> faster.
>
> Best,
>
> Tim
>
> On Tue, Jul 28, 2020 at 10:56 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
>
>> Yes, please
>>
>> Thanks in advance!
>>
>> Am 28.07.20 um 12:45 schrieb Tim Allison:
>>> Y. I can run these today
>>>
>>> On Tue, Jul 28, 2020 at 2:58 AM Andreas Lehmkuehler <an...@lehmi.de>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> is there any chance to run the PDFBox regression tests (2.0.20 vs.
>>>> SNAPSHOT) on
>>>> our new box? Does anyone had the cycles to prepare something ready to
>>>> start?
>>>>
>>>> If not, is there anything I can do to help? I'm planning to cut a new
>>>> PDFBox
>>>> release soon.
>>>>
>>>> Cheers
>>>> Andreas
>>>>
>>>
>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: PDFBox regression tests?
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 12.08.20 um 23:21 schrieb Tim Allison:
> All,
> Apologies for my delay...
>
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz
Thanks!
> I haven't had a chance to look at the reports yet. :(
I had a look at the new exceptions and both are already in 2.0.20. I can't see
any showstopper here.
Andreas
> I tried to update the instructions for running the process on the vm.
> Please let me know if you have any questions, or if I need to make
> improvements.
>
> Thank you.
>
> Best,
>
> Tim
>
> On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
>
>> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
>>> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>>>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>>>> I've looked at all the files I had highlighted yesterday. All
>> differences
>>>>> except two are related to the metadata problem.
>>>>>
>>>>> The other two have a problem with spaces, i.e. glyphs not being near
>> each other.
>>>>>
>>>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>>>
>>>>> This doesn't have to be a bug, I've seen many files where the
>> extraction is
>>>>> better, so whatever change there is may have improved more things.
>>>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
>> we?
>>>
>>>
>>> Yeah we could.
>>>
>>> But if the bug gets solved it would be nice to have a new diff output to
>> see if
>>> anything else gets shown more clearly.
>> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
>> anything
>> else we have to wait before we run the tests again, maybe some tika fix?
>>
>> Andreas
>>
>>> Tilman
>>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Tilman
>>>>>
>>>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>>>> Hi,
>>>>>>
>>>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>>>
>>>>>> There's something with the XMP metadata extraction. dc:title: is
>> empty (or
>>>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>>>
>>>>>> I thought this could be related to some minor xmpbox changes but tika
>>>>>> doesn't use it. So I searched and found some changes in
>> PDMetadataExtractor.
>>>>>>
>>>>>> I'm not yet sure if that is the cause, although I played around with
>> that one.
>>>>>>
>>>>>> If it is, then it is related to
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>>>> Looks like there may be some issues with Japanese...don't know if
>> this is
>>>>>>> related to your observation?
>>>>>>>
>>>>>>> It feels like when I sort by ascending order of
>>>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
>> pairs
>>>>>>> in the "lost common tokens".
>>>>>>>
>>>>>>> Will look a bit more.
>>>>>>>
>>>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>>>> Reports are here:
>>>>>>>>>
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>>>
>>>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
>> has
>>>>>>>> meaningful content, that is suspicious and needs further
>> investigation.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: PDFBox regression tests?
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 12.08.2020 um 23:21 schrieb Tim Allison:
> All,
> Apologies for my delay...
>
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz
>
> I haven't had a chance to look at the reports yet. :(
>
> I tried to update the instructions for running the process on the vm.
> Please let me know if you have any questions, or if I need to make
> improvements.
Thanks, it looks good that there are no show stoppers, i.e. thumbs up
from me.
But the tika problem is still there, although less. It is gone for the
file I mentioned (or the file wasn't in the test), but not for others, e.g.
commoncrawl3/6V/6VTB5IUKXBFA3JZPJBUPVSRY7L56K6LE
commoncrawl3/5I/5I6STZEO5W25GPETYGLCDLIB6OKQXCIG
Tilman
>
> Thank you.
>
> Best,
>
> Tim
>
> On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
>
>> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
>>> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>>>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>>>> I've looked at all the files I had highlighted yesterday. All
>> differences
>>>>> except two are related to the metadata problem.
>>>>>
>>>>> The other two have a problem with spaces, i.e. glyphs not being near
>> each other.
>>>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>>>
>>>>> This doesn't have to be a bug, I've seen many files where the
>> extraction is
>>>>> better, so whatever change there is may have improved more things.
>>>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
>> we?
>>>
>>> Yeah we could.
>>>
>>> But if the bug gets solved it would be nice to have a new diff output to
>> see if
>>> anything else gets shown more clearly.
>> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
>> anything
>> else we have to wait before we run the tests again, maybe some tika fix?
>>
>> Andreas
>>
>>> Tilman
>>>
>>>
>>>
>>>>
>>>>> Tilman
>>>>>
>>>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>>>> Hi,
>>>>>>
>>>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>>>
>>>>>> There's something with the XMP metadata extraction. dc:title: is
>> empty (or
>>>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>>>
>>>>>> I thought this could be related to some minor xmpbox changes but tika
>>>>>> doesn't use it. So I searched and found some changes in
>> PDMetadataExtractor.
>>>>>> I'm not yet sure if that is the cause, although I played around with
>> that one.
>>>>>> If it is, then it is related to
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>>>> Looks like there may be some issues with Japanese...don't know if
>> this is
>>>>>>> related to your observation?
>>>>>>>
>>>>>>> It feels like when I sort by ascending order of
>>>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
>> pairs
>>>>>>> in the "lost common tokens".
>>>>>>>
>>>>>>> Will look a bit more.
>>>>>>>
>>>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>>>> Reports are here:
>>>>>>>>>
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
>> has
>>>>>>>> meaningful content, that is suspicious and needs further
>> investigation.
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: PDFBox regression tests?
Posted by Tim Allison <ta...@apache.org>.
All,
Apologies for my delay...
Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz
I haven't had a chance to look at the reports yet. :(
I tried to update the instructions for running the process on the vm.
Please let me know if you have any questions, or if I need to make
improvements.
Thank you.
Best,
Tim
On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:
> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
> > Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
> >> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
> >>> I've looked at all the files I had highlighted yesterday. All
> differences
> >>> except two are related to the metadata problem.
> >>>
> >>> The other two have a problem with spaces, i.e. glyphs not being near
> each other.
> >>>
> >>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
> >>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
> >>>
> >>> This doesn't have to be a bug, I've seen many files where the
> extraction is
> >>> better, so whatever change there is may have improved more things.
> >> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
> we?
> >
> >
> > Yeah we could.
> >
> > But if the bug gets solved it would be nice to have a new diff output to
> see if
> > anything else gets shown more clearly.
> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
> anything
> else we have to wait before we run the tests again, maybe some tika fix?
>
> Andreas
>
> > Tilman
> >
> >
> >
> >>
> >>
> >>>
> >>> Tilman
> >>>
> >>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> >>>> Hi,
> >>>>
> >>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
> >>>>
> >>>> There's something with the XMP metadata extraction. dc:title: is
> empty (or
> >>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
> >>>>
> >>>> I thought this could be related to some minor xmpbox changes but tika
> >>>> doesn't use it. So I searched and found some changes in
> PDMetadataExtractor.
> >>>>
> >>>> I'm not yet sure if that is the cause, although I played around with
> that one.
> >>>>
> >>>> If it is, then it is related to
> >>>>
> >>>> https://issues.apache.org/jira/browse/TIKA-3101
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
> >>>>> Looks like there may be some issues with Japanese...don't know if
> this is
> >>>>> related to your observation?
> >>>>>
> >>>>> It feels like when I sort by ascending order of
> >>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
> pairs
> >>>>> in the "lost common tokens".
> >>>>>
> >>>>> Will look a bit more.
> >>>>>
> >>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
> THausherr@t-online.de>
> >>>>> wrote:
> >>>>>
> >>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
> >>>>>>> Reports are here:
> >>>>>>>
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
> >>>>>>
> >>>>>> Thank you. Besides the exceptions, there are a few cases in content
> >>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
> has
> >>>>>> meaningful content, that is suspicious and needs further
> investigation.
> >>>>>>
> >>>>>> Tilman
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>
Re: PDFBox regression tests?
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>> I've looked at all the files I had highlighted yesterday. All differences
>>> except two are related to the metadata problem.
>>>
>>> The other two have a problem with spaces, i.e. glyphs not being near each other.
>>>
>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>
>>> This doesn't have to be a bug, I've seen many files where the extraction is
>>> better, so whatever change there is may have improved more things.
>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't we?
>
>
> Yeah we could.
>
> But if the bug gets solved it would be nice to have a new diff output to see if
> anything else gets shown more clearly.
I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there anything
else we have to wait before we run the tests again, maybe some tika fix?
Andreas
> Tilman
>
>
>
>>
>>
>>>
>>> Tilman
>>>
>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>> Hi,
>>>>
>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>
>>>> There's something with the XMP metadata extraction. dc:title: is empty (or
>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>
>>>> I thought this could be related to some minor xmpbox changes but tika
>>>> doesn't use it. So I searched and found some changes in PDMetadataExtractor.
>>>>
>>>> I'm not yet sure if that is the cause, although I played around with that one.
>>>>
>>>> If it is, then it is related to
>>>>
>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>
>>>> Tilman
>>>>
>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>> Looks like there may be some issues with Japanese...don't know if this is
>>>>> related to your observation?
>>>>>
>>>>> It feels like when I sort by ascending order of
>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
>>>>> in the "lost common tokens".
>>>>>
>>>>> Will look a bit more.
>>>>>
>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>>>>> wrote:
>>>>>
>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>> Reports are here:
>>>>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>
>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>>>>> meaningful content, that is suspicious and needs further investigation.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: PDFBox regression tests?
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>> I've looked at all the files I had highlighted yesterday. All
>> differences except two are related to the metadata problem.
>>
>> The other two have a problem with spaces, i.e. glyphs not being near
>> each other.
>>
>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>
>> This doesn't have to be a bug, I've seen many files where the
>> extraction is better, so whatever change there is may have improved
>> more things.
> Thanks, for the analysis. IMHO we are good to cut a new release,
> aren't we?
Yeah we could.
But if the bug gets solved it would be nice to have a new diff output to
see if anything else gets shown more clearly.
Tilman
>
>
>>
>> Tilman
>>
>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>> Hi,
>>>
>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>
>>> There's something with the XMP metadata extraction. dc:title: is
>>> empty (or an empty line and maybe spaces) in tika 1.25 but not in
>>> tika 1.24.
>>>
>>> I thought this could be related to some minor xmpbox changes but
>>> tika doesn't use it. So I searched and found some changes in
>>> PDMetadataExtractor.
>>>
>>> I'm not yet sure if that is the cause, although I played around with
>>> that one.
>>>
>>> If it is, then it is related to
>>>
>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>
>>> Tilman
>>>
>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>> Looks like there may be some issues with Japanese...don't know if
>>>> this is
>>>> related to your observation?
>>>>
>>>> It feels like when I sort by ascending order of
>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn
>>>> language pairs
>>>> in the "lost common tokens".
>>>>
>>>> Will look a bit more.
>>>>
>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr
>>>> <TH...@t-online.de>
>>>> wrote:
>>>>
>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>> Reports are here:
>>>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>
>>>>>
>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>> extraction where "TOP_10_MORE_IN_B" is empty and
>>>>> "TOP_10_MORE_IN_A" has
>>>>> meaningful content, that is suspicious and needs further
>>>>> investigation.
>>>>>
>>>>> Tilman
>>>>>
>>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: PDFBox regression tests?
Posted by Andreas Lehmkuehler <an...@lehmi.de>.
Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
> I've looked at all the files I had highlighted yesterday. All differences except
> two are related to the metadata problem.
>
> The other two have a problem with spaces, i.e. glyphs not being near each other.
>
> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>
> This doesn't have to be a bug, I've seen many files where the extraction is
> better, so whatever change there is may have improved more things.
Thanks, for the analysis. IMHO we are good to cut a new release, aren't we?
>
> Tilman
>
> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>> Hi,
>>
>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>
>> There's something with the XMP metadata extraction. dc:title: is empty (or an
>> empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>
>> I thought this could be related to some minor xmpbox changes but tika doesn't
>> use it. So I searched and found some changes in PDMetadataExtractor.
>>
>> I'm not yet sure if that is the cause, although I played around with that one.
>>
>> If it is, then it is related to
>>
>> https://issues.apache.org/jira/browse/TIKA-3101
>>
>> Tilman
>>
>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>> Looks like there may be some issues with Japanese...don't know if this is
>>> related to your observation?
>>>
>>> It feels like when I sort by ascending order of
>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
>>> in the "lost common tokens".
>>>
>>> Will look a bit more.
>>>
>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>>> wrote:
>>>
>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>> Reports are here:
>>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>
>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>>> meaningful content, that is suspicious and needs further investigation.
>>>>
>>>> Tilman
>>>>
>>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: PDFBox regression tests?
Posted by Tilman Hausherr <TH...@t-online.de>.
I've looked at all the files I had highlighted yesterday. All
differences except two are related to the metadata problem.
The other two have a problem with spaces, i.e. glyphs not being near
each other.
commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
This doesn't have to be a bug, I've seen many files where the extraction
is better, so whatever change there is may have improved more things.
Tilman
Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> Hi,
>
> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>
> There's something with the XMP metadata extraction. dc:title: is empty
> (or an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>
> I thought this could be related to some minor xmpbox changes but tika
> doesn't use it. So I searched and found some changes in
> PDMetadataExtractor.
>
> I'm not yet sure if that is the cause, although I played around with
> that one.
>
> If it is, then it is related to
>
> https://issues.apache.org/jira/browse/TIKA-3101
>
> Tilman
>
> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>> Looks like there may be some issues with Japanese...don't know if
>> this is
>> related to your observation?
>>
>> It feels like when I sort by ascending order of
>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
>> pairs
>> in the "lost common tokens".
>>
>> Will look a bit more.
>>
>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>> Reports are here:
>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>
>>>
>>> Thank you. Besides the exceptions, there are a few cases in content
>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>> meaningful content, that is suspicious and needs further investigation.
>>>
>>> Tilman
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: PDFBox regression tests?
Posted by Tilman Hausherr <TH...@t-online.de>.
I got a bit closer. IMHO it happens here:
private static void setNotNull(Property property, String value,
Metadata metadata) {
if (metadata.get(property) == null && !
StringUtils.isEmpty(value)) {
metadata.set(property, value);
}
}
if "value" is not empty but only spaces then the problem happens.
The PDF has a buggy XMP so you get no title from DublinCore but some
title from Basic. However this "some title" from Basic is just spaces
(which may or may not be a bug) and shouldn't be used. If this is
skipped then we have the old behavior.
Tilman
Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> Hi,
>
> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>
> There's something with the XMP metadata extraction. dc:title: is empty
> (or an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>
> I thought this could be related to some minor xmpbox changes but tika
> doesn't use it. So I searched and found some changes in
> PDMetadataExtractor.
>
> I'm not yet sure if that is the cause, although I played around with
> that one.
>
> If it is, then it is related to
>
> https://issues.apache.org/jira/browse/TIKA-3101
>
> Tilman
>
> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>> Looks like there may be some issues with Japanese...don't know if
>> this is
>> related to your observation?
>>
>> It feels like when I sort by ascending order of
>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
>> pairs
>> in the "lost common tokens".
>>
>> Will look a bit more.
>>
>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
>> wrote:
>>
>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>> Reports are here:
>>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>
>>>
>>> Thank you. Besides the exceptions, there are a few cases in content
>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>>> meaningful content, that is suspicious and needs further investigation.
>>>
>>> Tilman
>>>
>>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: PDFBox regression tests?
Posted by Tilman Hausherr <TH...@t-online.de>.
Hi,
I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
There's something with the XMP metadata extraction. dc:title: is empty
(or an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
I thought this could be related to some minor xmpbox changes but tika
doesn't use it. So I searched and found some changes in PDMetadataExtractor.
I'm not yet sure if that is the cause, although I played around with
that one.
If it is, then it is related to
https://issues.apache.org/jira/browse/TIKA-3101
Tilman
Am 30.07.2020 um 12:43 schrieb Tim Allison:
> Looks like there may be some issues with Japanese...don't know if this is
> related to your observation?
>
> It feels like when I sort by ascending order of
> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
> in the "lost common tokens".
>
> Will look a bit more.
>
> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
> wrote:
>
>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>> Reports are here:
>>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>
>> Thank you. Besides the exceptions, there are a few cases in content
>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
>> meaningful content, that is suspicious and needs further investigation.
>>
>> Tilman
>>
>>
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org
Re: PDFBox regression tests?
Posted by Tim Allison <ta...@apache.org>.
Looks like there may be some issues with Japanese...don't know if this is
related to your observation?
It feels like when I sort by ascending order of
NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language pairs
in the "lost common tokens".
Will look a bit more.
On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <TH...@t-online.de>
wrote:
> Am 28.07.2020 um 23:51 schrieb Tim Allison:
> > Reports are here:
> > https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>
>
> Thank you. Besides the exceptions, there are a few cases in content
> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
> meaningful content, that is suspicious and needs further investigation.
>
> Tilman
>
>
Re: PDFBox regression tests?
Posted by Tilman Hausherr <TH...@t-online.de>.
Am 28.07.2020 um 23:51 schrieb Tim Allison:
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
Thank you. Besides the exceptions, there are a few cases in content
extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A" has
meaningful content, that is suspicious and needs further investigation.
Tilman