You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by Tim Allison <ta...@apache.org> on 2020/08/12 21:21:18 UTC

Re: PDFBox regression tests?

All,
  Apologies for my delay...

Reports are here:
https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz

I haven't had a chance to look at the reports yet. :(

I tried to update the instructions for running the process on the vm.
Please let me know if you have any questions, or if I need to make
improvements.

Thank you.

   Best,

              Tim

On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
> > Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
> >> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
> >>> I've looked at all the files I had highlighted yesterday. All
> differences
> >>> except two are related to the metadata problem.
> >>>
> >>> The other two have a problem with spaces, i.e. glyphs not being near
> each other.
> >>>
> >>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
> >>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
> >>>
> >>> This doesn't have to be a bug, I've seen many files where the
> extraction is
> >>> better, so whatever change there is may have improved more things.
> >> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
> we?
> >
> >
> > Yeah we could.
> >
> > But if the bug gets solved it would be nice to have a new diff output to
> see if
> > anything else gets shown more clearly.
> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
> anything
> else we have to wait before we run the tests again, maybe some tika fix?
>
> Andreas
>
> > Tilman
> >
> >
> >
> >>
> >>
> >>>
> >>> Tilman
> >>>
> >>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
> >>>> Hi,
> >>>>
> >>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
> >>>>
> >>>> There's something with the XMP metadata extraction. dc:title: is
> empty (or
> >>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
> >>>>
> >>>> I thought this could be related to some minor xmpbox changes but tika
> >>>> doesn't use it. So I searched and found some changes in
> PDMetadataExtractor.
> >>>>
> >>>> I'm not yet sure if that is the cause, although I played around with
> that one.
> >>>>
> >>>> If it is, then it is related to
> >>>>
> >>>> https://issues.apache.org/jira/browse/TIKA-3101
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
> >>>>> Looks like there may be some issues with Japanese...don't know if
> this is
> >>>>> related to your observation?
> >>>>>
> >>>>> It feels like when I sort by ascending order of
> >>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
> pairs
> >>>>> in the "lost common tokens".
> >>>>>
> >>>>> Will look a bit more.
> >>>>>
> >>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
> THausherr@t-online.de>
> >>>>> wrote:
> >>>>>
> >>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
> >>>>>>> Reports are here:
> >>>>>>>
> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
> >>>>>>
> >>>>>> Thank you. Besides the exceptions, there are a few cases in content
> >>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
> has
> >>>>>> meaningful content, that is suspicious and needs further
> investigation.
> >>>>>>
> >>>>>> Tilman
> >>>>>>
> >>>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>>
> >>>
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >>> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: dev-help@pdfbox.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: dev-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: dev-help@pdfbox.apache.org
>
>

Re: PDFBox regression tests?

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 12.08.20 um 23:21 schrieb Tim Allison:
> All,
>    Apologies for my delay...
> 
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz
Thanks!

> I haven't had a chance to look at the reports yet. :(
I had a look at the new exceptions and both are already in 2.0.20. I can't see 
any showstopper here.

Andreas
> I tried to update the instructions for running the process on the vm.
> Please let me know if you have any questions, or if I need to make
> improvements.
> 
> Thank you.
> 
>     Best,
> 
>                Tim
> 
> On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> 
>> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
>>> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>>>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>>>> I've looked at all the files I had highlighted yesterday. All
>> differences
>>>>> except two are related to the metadata problem.
>>>>>
>>>>> The other two have a problem with spaces, i.e. glyphs not being near
>> each other.
>>>>>
>>>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>>>
>>>>> This doesn't have to be a bug, I've seen many files where the
>> extraction is
>>>>> better, so whatever change there is may have improved more things.
>>>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
>> we?
>>>
>>>
>>> Yeah we could.
>>>
>>> But if the bug gets solved it would be nice to have a new diff output to
>> see if
>>> anything else gets shown more clearly.
>> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
>> anything
>> else we have to wait before we run the tests again, maybe some tika fix?
>>
>> Andreas
>>
>>> Tilman
>>>
>>>
>>>
>>>>
>>>>
>>>>>
>>>>> Tilman
>>>>>
>>>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>>>> Hi,
>>>>>>
>>>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>>>
>>>>>> There's something with the XMP metadata extraction. dc:title: is
>> empty (or
>>>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>>>
>>>>>> I thought this could be related to some minor xmpbox changes but tika
>>>>>> doesn't use it. So I searched and found some changes in
>> PDMetadataExtractor.
>>>>>>
>>>>>> I'm not yet sure if that is the cause, although I played around with
>> that one.
>>>>>>
>>>>>> If it is, then it is related to
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>>>> Looks like there may be some issues with Japanese...don't know if
>> this is
>>>>>>> related to your observation?
>>>>>>>
>>>>>>> It feels like when I sort by ascending order of
>>>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
>> pairs
>>>>>>> in the "lost common tokens".
>>>>>>>
>>>>>>> Will look a bit more.
>>>>>>>
>>>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>>>> Reports are here:
>>>>>>>>>
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>>>
>>>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
>> has
>>>>>>>> meaningful content, that is suspicious and needs further
>> investigation.
>>>>>>>>
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org

Re: PDFBox regression tests?

Posted by Tilman Hausherr <TH...@t-online.de>.

Am 12.08.2020 um 23:21 schrieb Tim Allison:
> All,
>    Apologies for my delay...
>
> Reports are here:
> https://corpora.tika.apache.org/base/reports/pdfbox-2_0_21-20200810-reports.tgz
>
> I haven't had a chance to look at the reports yet. :(
>
> I tried to update the instructions for running the process on the vm.
> Please let me know if you have any questions, or if I need to make
> improvements.


Thanks, it looks good that there are no show stoppers, i.e. thumbs up 
from me.

But the tika problem is still there, although less. It is gone for the 
file I mentioned (or the file wasn't in the test), but not for others, e.g.

commoncrawl3/6V/6VTB5IUKXBFA3JZPJBUPVSRY7L56K6LE

commoncrawl3/5I/5I6STZEO5W25GPETYGLCDLIB6OKQXCIG

Tilman



>
> Thank you.
>
>     Best,
>
>                Tim
>
> On Fri, Jul 31, 2020 at 9:55 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
>
>> Am 31.07.20 um 08:27 schrieb Tilman Hausherr:
>>> Am 31.07.2020 um 08:05 schrieb Andreas Lehmkuehler:
>>>> Am 30.07.20 um 20:22 schrieb Tilman Hausherr:
>>>>> I've looked at all the files I had highlighted yesterday. All
>> differences
>>>>> except two are related to the metadata problem.
>>>>>
>>>>> The other two have a problem with spaces, i.e. glyphs not being near
>> each other.
>>>>> commoncrawl3_refetched/PK/PKI7PHOBREMT7QBJIL5KZNVE2EVAFLBQ
>>>>> commoncrawl3_refetched/WZ/WZTGH3ACJPLKXD5JBAH2AONGOMYNYDUJ
>>>>>
>>>>> This doesn't have to be a bug, I've seen many files where the
>> extraction is
>>>>> better, so whatever change there is may have improved more things.
>>>> Thanks, for the analysis. IMHO we are good to cut a new release, aren't
>> we?
>>>
>>> Yeah we could.
>>>
>>> But if the bug gets solved it would be nice to have a new diff output to
>> see if
>>> anything else gets shown more clearly.
>> I forgot to mention that the bug from PDFBOX-4927 is fixed. Is there
>> anything
>> else we have to wait before we run the tests again, maybe some tika fix?
>>
>> Andreas
>>
>>> Tilman
>>>
>>>
>>>
>>>>
>>>>> Tilman
>>>>>
>>>>> Am 30.07.2020 um 18:05 schrieb Tilman Hausherr:
>>>>>> Hi,
>>>>>>
>>>>>> I started with 3AXKIR2SX6TIFMLGSRTLOM2SDOPLBJ5P.pdf
>>>>>>
>>>>>> There's something with the XMP metadata extraction. dc:title: is
>> empty (or
>>>>>> an empty line and maybe spaces) in tika 1.25 but not in tika 1.24.
>>>>>>
>>>>>> I thought this could be related to some minor xmpbox changes but tika
>>>>>> doesn't use it. So I searched and found some changes in
>> PDMetadataExtractor.
>>>>>> I'm not yet sure if that is the cause, although I played around with
>> that one.
>>>>>> If it is, then it is related to
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/TIKA-3101
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 30.07.2020 um 12:43 schrieb Tim Allison:
>>>>>>> Looks like there may be some issues with Japanese...don't know if
>> this is
>>>>>>> related to your observation?
>>>>>>>
>>>>>>> It feels like when I sort by ascending order of
>>>>>>> NUM_COMMON_TOKENS_DIFF_IN_B, there are quite a few jpn->jpn language
>> pairs
>>>>>>> in the "lost common tokens".
>>>>>>>
>>>>>>> Will look a bit more.
>>>>>>>
>>>>>>> On Thu, Jul 30, 2020 at 2:26 AM Tilman Hausherr <
>> THausherr@t-online.de>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Am 28.07.2020 um 23:51 schrieb Tim Allison:
>>>>>>>>> Reports are here:
>>>>>>>>>
>> https://corpora.tika.apache.org/base/reports/pdfbox-2.0.21-SNAPSHOT.tgz
>>>>>>>> Thank you. Besides the exceptions, there are a few cases in content
>>>>>>>> extraction where "TOP_10_MORE_IN_B" is empty and "TOP_10_MORE_IN_A"
>> has
>>>>>>>> meaningful content, that is suspicious and needs further
>> investigation.
>>>>>>>> Tilman
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: dev-help@pdfbox.apache.org
>>
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org