You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@pdfbox.apache.org by Joel Hirsh <jo...@gmail.com> on 2020/04/05 21:17:46 UTC

Re: Regression in 2.0.19

Thank you.

Are those changes likely to be a problem in the future though?  I had
noticed that the changes did get slightly better results when reading PDF's
from OCR scans which had lots of extraneous text from hand writing on the
paper document.  So I assume there is a good reason for them.

On Tue, Mar 31, 2020 at 3:24 AM Andreas Lehmkuehler <an...@lehmi.de>
wrote:

> I've fixed the issue in the 2.0 branch, the trunk isn't affected, see
> PDFBOX-4805.
>
> @Joel: thanks for reporting and debugging the issue, especially as it was
> limited to some corner cases. Sorry for the inconvience.
>
> Andreas
>
> Am 30.03.20 um 11:03 schrieb Andreas Lehmkuehler:
> > Looks like I accidentally committed some unrelated code :-(
> > I've to check that.
> >
> > Am 30.03.20 um 10:56 schrieb Andreas Lehmkuehler:
> >> Thanks for the debugging. Those changes were made in PDFBOX-4760, that
> should
> >> help us find the issue.
> >>
> >> Andreas
> >>
> >> Am 30.03.20 um 06:38 schrieb Joel Hirsh:
> >>> I did try to create a test case by taking out most of the text on a
> page,
> >>> but that also fixed the problem.
> >>>
> >>> I did verify that neither of the changes to PDTrueTypeFont for
> PDFBOX-4755
> >>> / PDF.js #5501 are coming into play.
> >>> Set a breakpoint at those lines, and no breaks. Also, one file that has
> >>> trouble is using a PDType0Font called 'fon2',
> >>> another uses a PDTrueTypeFont.
> >>>
> >>> I just started counting bad Unicode characters for other reasons, by
> >>> overriding PDFTextProcessor.showText().
> >>> I put in a change to test the return from font.toUnicode(code) to see
> if it
> >>> is null, and just count them. And there are no nulls coming back.
> >>> But the text breakup occurs with or without my override.
> >>>
> >>> So I did compare and there are not a whole lot of other changes from
> 2.0.18
> >>> to 2.0.19. Turns out that if I
> >>> revert to the old version of PDFTextStripper.overlap()  (two lines of
> code)
> >>> then the problem goes away.
> >>> What were they supposed to address?
> >>>
> >>> Regards
> >>>
> >>> On Wed, Mar 4, 2020 at 8:34 PM Tilman Hausherr <TH...@t-online.de>
> >>> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> Please try to submit a test case.
> >>>>
> >>>> My guess is that this is related to bad /ToUnicode streams.
> >>>>
> >>>> Tilman
> >>>>
> >>>> Am 05.03.2020 um 03:09 schrieb Joel Hirsh:
> >>>>> I just started testing with version 2.0.19.
> >>>>>
> >>>>> I am using PDFTextStripper and some files that gave back fine
> results in
> >>>>> 2.0.18 are completely useless with 2.0.19.  As an example, I have one
> >>>> file
> >>>>> that gets about 600 phrases in 2.0.18.  In 2.0.19 it gets over 16,000
> >>>>> phrases the majority of which of are a zero length string, and most
> of
> >>>> the
> >>>>> rest are single characters making up the phrase, rather than a
> phrase.
> >>>>>
> >>>>> The file is confidential, so I cannot just post it.
> >>>>>
> >>>>> Am I telling you something that you already know about, or should I
> try
> >>>> to
> >>>>> submit a test case? Or is there some new option I am unaware of?
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >>>> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>>>
> >>>>
> >>>
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> >> For additional commands, e-mail: users-help@pdfbox.apache.org
> >>
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> > For additional commands, e-mail: users-help@pdfbox.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>
>

Re: Regression in 2.0.19

Posted by Andreas Lehmkuehler <an...@lehmi.de>.

Am 05.04.20 um 23:17 schrieb Joel Hirsh:
> Thank you.
> 
> Are those changes likely to be a problem in the future though?  I had
> noticed that the changes did get slightly better results when reading PDF's
> from OCR scans which had lots of extraneous text from hand writing on the
> paper document.  So I assume there is a good reason for them.
Some time ago there was a proposal to change that part of the text extraction to 
get better results for some corner cases (once I found the related thread I'm 
going to post a pointer to it). I experimented with some changes and ended up 
with those I've accidentally committed. They had no influence on many cases and 
worked well for the given corner case but obviously the other side of the coin 
led to the current regression.

We are all aware that we have to overhaul the whole text extraction stuff. IMHO 
it doesn't make that much sense to put to much effort into changes with a small 
effect but a huge potential to introduce a regression.

Andreas

> 
> On Tue, Mar 31, 2020 at 3:24 AM Andreas Lehmkuehler <an...@lehmi.de>
> wrote:
> 
>> I've fixed the issue in the 2.0 branch, the trunk isn't affected, see
>> PDFBOX-4805.
>>
>> @Joel: thanks for reporting and debugging the issue, especially as it was
>> limited to some corner cases. Sorry for the inconvience.
>>
>> Andreas
>>
>> Am 30.03.20 um 11:03 schrieb Andreas Lehmkuehler:
>>> Looks like I accidentally committed some unrelated code :-(
>>> I've to check that.
>>>
>>> Am 30.03.20 um 10:56 schrieb Andreas Lehmkuehler:
>>>> Thanks for the debugging. Those changes were made in PDFBOX-4760, that
>> should
>>>> help us find the issue.
>>>>
>>>> Andreas
>>>>
>>>> Am 30.03.20 um 06:38 schrieb Joel Hirsh:
>>>>> I did try to create a test case by taking out most of the text on a
>> page,
>>>>> but that also fixed the problem.
>>>>>
>>>>> I did verify that neither of the changes to PDTrueTypeFont for
>> PDFBOX-4755
>>>>> / PDF.js #5501 are coming into play.
>>>>> Set a breakpoint at those lines, and no breaks. Also, one file that has
>>>>> trouble is using a PDType0Font called 'fon2',
>>>>> another uses a PDTrueTypeFont.
>>>>>
>>>>> I just started counting bad Unicode characters for other reasons, by
>>>>> overriding PDFTextProcessor.showText().
>>>>> I put in a change to test the return from font.toUnicode(code) to see
>> if it
>>>>> is null, and just count them. And there are no nulls coming back.
>>>>> But the text breakup occurs with or without my override.
>>>>>
>>>>> So I did compare and there are not a whole lot of other changes from
>> 2.0.18
>>>>> to 2.0.19. Turns out that if I
>>>>> revert to the old version of PDFTextStripper.overlap()  (two lines of
>> code)
>>>>> then the problem goes away.
>>>>> What were they supposed to address?
>>>>>
>>>>> Regards
>>>>>
>>>>> On Wed, Mar 4, 2020 at 8:34 PM Tilman Hausherr <TH...@t-online.de>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Please try to submit a test case.
>>>>>>
>>>>>> My guess is that this is related to bad /ToUnicode streams.
>>>>>>
>>>>>> Tilman
>>>>>>
>>>>>> Am 05.03.2020 um 03:09 schrieb Joel Hirsh:
>>>>>>> I just started testing with version 2.0.19.
>>>>>>>
>>>>>>> I am using PDFTextStripper and some files that gave back fine
>> results in
>>>>>>> 2.0.18 are completely useless with 2.0.19.  As an example, I have one
>>>>>> file
>>>>>>> that gets about 600 phrases in 2.0.18.  In 2.0.19 it gets over 16,000
>>>>>>> phrases the majority of which of are a zero length string, and most
>> of
>>>>>> the
>>>>>>> rest are single characters making up the phrase, rather than a
>> phrase.
>>>>>>>
>>>>>>> The file is confidential, so I cannot just post it.
>>>>>>>
>>>>>>> Am I telling you something that you already know about, or should I
>> try
>>>>>> to
>>>>>>> submit a test case? Or is there some new option I am unaware of?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org