You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@pdfbox.apache.org by Justinus Menzel <ju...@vyasa.com> on 2020/01/29 16:46:19 UTC

question about LegacyPDFStreamEngine

Hi,

I'm trying to convert PDF into XML and I'm using PDFText2HTML class in
tools as inspiration. I noticed that
PDFText2HTML extends PDFTextStripper which extends LegacyPDFStreamEngine.
The comment sections on top of LegacyPDFStreamEngine says something
peculiar:
  * This class exists only so that we don't break the code of users who
have their own subclasses
  * of PDFTextStripper. It replaces the good implementation of showGlyph in
PDFStreamEngine, with
  * a bad implementation which is backwards compatible.

I looked at the comment of the "good implementation" of showGlyph in
PDFStreamEngine and it says there:
  * Called when a glyph is to be processed.This method is intended for
overriding in subclasses,
  * the default implementation does nothing.

PDFStreamEngine.showGlyph is supposed to be the "good implementation" but
it does nothing
and a subclass should override it. How does this make sense?
If LegacyPDFStreamEngine's showGlyph implementation is incorrect but
PDFTextStripper relies on the incorrect behavior of showGlyph, how does
PDFTextStripper compensate for it?
Does it even matter in PDFTextStripper's case since this class just
extracts text it's not rendering anything?

Any clarification would be helpful.
Thank you.

-Justinus

Re: question about LegacyPDFStreamEngine

Posted by Tilman Hausherr <TH...@t-online.de>.
done (adjusted comments)

Tilman

Am 02.02.2020 um 17:19 schrieb Tilman Hausherr:
> I had a look... the calculations were indeed in PDFStreamEngine, but 
> in processEncodedText(). ShowGlyph() didn't exist previously and when 
> it existed has always been mostly empty.
>
> I'll adjust the comment within a few days (wait for feedback here) to 
> replace "good" with "mostly empty".
>
> With "bad" it is meant that the heights are a heuristic and not the 
> real height. I'll adjust that text accordingly to something more neutral.
>
> PDFTextStripper needs the "bad" implementation as it is now. The 
> height is needed to decide whether glyphs are on the same line or not.
>
> The comment "This method was originally written by Ben Litchfield for 
> PDFStreamEngine." is also wrong, "showGlyph" was introduced by Jon 
> Hewson in 8/2014 and wasn't written by Ben Litchfield, but the 
> calculations were.
>
> Tilman
>
> Am 29.01.2020 um 20:00 schrieb Tilman Hausherr:
>> Am 29.01.2020 um 17:46 schrieb Justinus Menzel:
>>> Hi,
>>>
>>> I'm trying to convert PDF into XML and I'm using PDFText2HTML class in
>>> tools as inspiration. I noticed that
>>> PDFText2HTML extends PDFTextStripper which extends 
>>> LegacyPDFStreamEngine.
>>> The comment sections on top of LegacyPDFStreamEngine says something
>>> peculiar:
>>>    * This class exists only so that we don't break the code of users 
>>> who
>>> have their own subclasses
>>>    * of PDFTextStripper. It replaces the good implementation of 
>>> showGlyph in
>>> PDFStreamEngine, with
>>>    * a bad implementation which is backwards compatible.
>>
>>
>> Ouch. One of those "never trust the comment" events. I suspect that 
>> he meant the code in PageDrawer, or that some "good" code was in 
>> PDFStreamEngine at that time. I'll look into the history when I have 
>> more time.
>>
>> Tilman
>>
>>
>>>
>>> I looked at the comment of the "good implementation" of showGlyph in
>>> PDFStreamEngine and it says there:
>>>    * Called when a glyph is to be processed.This method is intended for
>>> overriding in subclasses,
>>>    * the default implementation does nothing.
>>>
>>> PDFStreamEngine.showGlyph is supposed to be the "good 
>>> implementation" but
>>> it does nothing
>>> and a subclass should override it. How does this make sense?
>>> If LegacyPDFStreamEngine's showGlyph implementation is incorrect but
>>> PDFTextStripper relies on the incorrect behavior of showGlyph, how does
>>> PDFTextStripper compensate for it?
>>> Does it even matter in PDFTextStripper's case since this class just
>>> extracts text it's not rendering anything?
>>>
>>> Any clarification would be helpful.
>>> Thank you.
>>>
>>> -Justinus
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
>> For additional commands, e-mail: users-help@pdfbox.apache.org
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: question about LegacyPDFStreamEngine

Posted by Tilman Hausherr <TH...@t-online.de>.
I had a look... the calculations were indeed in PDFStreamEngine, but in 
processEncodedText(). ShowGlyph() didn't exist previously and when it 
existed has always been mostly empty.

I'll adjust the comment within a few days (wait for feedback here) to 
replace "good" with "mostly empty".

With "bad" it is meant that the heights are a heuristic and not the real 
height. I'll adjust that text accordingly to something more neutral.

PDFTextStripper needs the "bad" implementation as it is now. The height 
is needed to decide whether glyphs are on the same line or not.

The comment "This method was originally written by Ben Litchfield for 
PDFStreamEngine." is also wrong, "showGlyph" was introduced by Jon 
Hewson in 8/2014 and wasn't written by Ben Litchfield, but the 
calculations were.

Tilman

Am 29.01.2020 um 20:00 schrieb Tilman Hausherr:
> Am 29.01.2020 um 17:46 schrieb Justinus Menzel:
>> Hi,
>>
>> I'm trying to convert PDF into XML and I'm using PDFText2HTML class in
>> tools as inspiration. I noticed that
>> PDFText2HTML extends PDFTextStripper which extends 
>> LegacyPDFStreamEngine.
>> The comment sections on top of LegacyPDFStreamEngine says something
>> peculiar:
>>    * This class exists only so that we don't break the code of users who
>> have their own subclasses
>>    * of PDFTextStripper. It replaces the good implementation of 
>> showGlyph in
>> PDFStreamEngine, with
>>    * a bad implementation which is backwards compatible.
>
>
> Ouch. One of those "never trust the comment" events. I suspect that he 
> meant the code in PageDrawer, or that some "good" code was in 
> PDFStreamEngine at that time. I'll look into the history when I have 
> more time.
>
> Tilman
>
>
>>
>> I looked at the comment of the "good implementation" of showGlyph in
>> PDFStreamEngine and it says there:
>>    * Called when a glyph is to be processed.This method is intended for
>> overriding in subclasses,
>>    * the default implementation does nothing.
>>
>> PDFStreamEngine.showGlyph is supposed to be the "good implementation" 
>> but
>> it does nothing
>> and a subclass should override it. How does this make sense?
>> If LegacyPDFStreamEngine's showGlyph implementation is incorrect but
>> PDFTextStripper relies on the incorrect behavior of showGlyph, how does
>> PDFTextStripper compensate for it?
>> Does it even matter in PDFTextStripper's case since this class just
>> extracts text it's not rendering anything?
>>
>> Any clarification would be helpful.
>> Thank you.
>>
>> -Justinus
>>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
> For additional commands, e-mail: users-help@pdfbox.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org


Re: question about LegacyPDFStreamEngine

Posted by Tilman Hausherr <TH...@t-online.de>.
Am 29.01.2020 um 17:46 schrieb Justinus Menzel:
> Hi,
>
> I'm trying to convert PDF into XML and I'm using PDFText2HTML class in
> tools as inspiration. I noticed that
> PDFText2HTML extends PDFTextStripper which extends LegacyPDFStreamEngine.
> The comment sections on top of LegacyPDFStreamEngine says something
> peculiar:
>    * This class exists only so that we don't break the code of users who
> have their own subclasses
>    * of PDFTextStripper. It replaces the good implementation of showGlyph in
> PDFStreamEngine, with
>    * a bad implementation which is backwards compatible.


Ouch. One of those "never trust the comment" events. I suspect that he 
meant the code in PageDrawer, or that some "good" code was in 
PDFStreamEngine at that time. I'll look into the history when I have 
more time.

Tilman


>
> I looked at the comment of the "good implementation" of showGlyph in
> PDFStreamEngine and it says there:
>    * Called when a glyph is to be processed.This method is intended for
> overriding in subclasses,
>    * the default implementation does nothing.
>
> PDFStreamEngine.showGlyph is supposed to be the "good implementation" but
> it does nothing
> and a subclass should override it. How does this make sense?
> If LegacyPDFStreamEngine's showGlyph implementation is incorrect but
> PDFTextStripper relies on the incorrect behavior of showGlyph, how does
> PDFTextStripper compensate for it?
> Does it even matter in PDFTextStripper's case since this class just
> extracts text it's not rendering anything?
>
> Any clarification would be helpful.
> Thank you.
>
> -Justinus
>


---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: users-help@pdfbox.apache.org