You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2009/09/02 07:47:32 UTC

[jira] Resolved: (PDFBOX-302) Improve font handling (was: layout print problem)

     [ https://issues.apache.org/jira/browse/PDFBOX-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-302.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 0.8.0-incubator

AFAIK there aren't any issues with this improvement, so that I'll set this to resolved.

For now there aren't any mappings mssing. If we find some later, it'll be no problem to add them.

> Improve font handling (was: layout print problem)
> -------------------------------------------------
>
>                 Key: PDFBOX-302
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-302
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: PDFReader
>            Reporter: Jukka Zitting
>            Assignee: Andreas Lehmkühler
>            Priority: Minor
>             Fix For: 0.8.0-incubator
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1787501
> Originally submitted by gjniewenhuijse on 2007-09-04 00:24.
> When i print the attached file, some things are not printed well.
> - The gray box at the top
> - and the fonts are printed bold and thats not right.
> Is there any solution for now, or for later? 
> When i open and print this file with adobe reader, everything is fine, but with pdfbox i've got a layout problem.
> I used the newest pdfbox version (also tested the nightly build)
> [attachment on SourceForge]
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1787501&file_id=244104
> orarrp.pdf (application/pdf), 7871 bytes
> pdf with print problem

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Resolved: (PDFBOX-302) Improve font handling

Posted by Tony Scerri <to...@gmail.com>.
I will look at generating appropriate patches for the two separate changes I
mentioned today. I have noticed one minor issue with text extraction after
the char and word spacing fix which results in an extra space being added in
one word in one of three pdfs i have been working with.

I have also made a third change relating to identification of fonts embeded
in a PDF after it was unable to extract the contained TTF as it failed to
load properly when AWT was called, the "NAME" table was missing which i
assume indicated a invalid/corrupt PDF (but not knowing too much about TTF
etc I'm not 100% sure). I may need to build specific sample PDF to submit as
the contents of the PDFs i'm working can not be circulated.

Tony.

2009/9/7 Andreas Lehmkühler <an...@lehmi.de>

> Hi Tony,
>
> is it possible to provide us with a sample document to test your patch?
> As attachements aren't allowed on the list, you have to create a new
> issue on JIRA and attach your sample.
>
> Thnkas in advance,
> Andreas Lehmkühler
>
> Tony Scerri schrieb:
>  > Not sure if this is a possible cause for issues others have reported. I
> > found that when creating images from PDFs I was getting a lot of jumbled
> > text, bits overlapping others etc, and generaly it looked wrong. Turns
> out
> > after much digging and tinkering that the FontManager was returning the
> > wrong font even for standard fonts available in most environments.
> >
> > The fix I put in was inside the iterations of the available AWT fonts
> inside
> > the loadFonts method of FontManager. The last line of the for loop I
> added:
> >
> >             envFonts.put(normalizeFontname(font.getPSName()),font);
> >
> > This puts in the post script name which is quite often used inside PDFs
> from
> > what I have been seeing lately on my work. This now has a much better
> chance
> > of looking up the correct font. I now dont have overlapped words etc
> because
> > the font has a much better metric with what was expected.
> >
> > I think this problem may be more prevelant on PDFs where the text has
> been
> > fully justified. I have run into a subsequent issues still plodding my
> way
> > through. Which is that I'm now left with large gaps in lines in the
> middle
> > of words because PDF box isnt rendering the word spacing correctly (might
> > also be character spacing) which is all down to the use of AWT rendering
> of
> > fonts which as far as I can tell wont allow for the kinds of control
> > required when rendering a whole string, the alternative seems to be to
> have
> > to render each character one by one with the appropriate displacement
> > between each glyph.
> >
> > Tony
> >
> > On Wed, Sep 2, 2009 at 6:47 AM, Andreas Lehmkühler (JIRA)
> > <ji...@apache.org>wrote:
> >
> >>     [
> >>
> https://issues.apache.org/jira/browse/PDFBOX-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
> ]
> >>
> >> Andreas Lehmkühler resolved PDFBOX-302.
> >> ---------------------------------------
> >>
> >>       Resolution: Fixed
> >>    Fix Version/s: 0.8.0-incubator
> >>
> >> AFAIK there aren't any issues with this improvement, so that I'll set
> this
> >> to resolved.
> >>
> >> For now there aren't any mappings mssing. If we find some later, it'll
> be
> >> no problem to add them.
> >>
> >>> Improve font handling (was: layout print problem)
> >>> -------------------------------------------------
> >>>
> >>>                 Key: PDFBOX-302
> >>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-302
> >>>             Project: PDFBox
> >>>          Issue Type: Improvement
> >>>          Components: PDFReader
> >>>            Reporter: Jukka Zitting
> >>>            Assignee: Andreas Lehmkühler
> >>>            Priority: Minor
> >>>             Fix For: 0.8.0-incubator
> >>>
> >>>
> >>> [imported from SourceForge]
> >>>
> >>
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1787501
> >>> Originally submitted by gjniewenhuijse on 2007-09-04 00:24.
> >>> When i print the attached file, some things are not printed well.
> >>> - The gray box at the top
> >>> - and the fonts are printed bold and thats not right.
> >>> Is there any solution for now, or for later?
> >>> When i open and print this file with adobe reader, everything is fine,
> >> but with pdfbox i've got a layout problem.
> >>> I used the newest pdfbox version (also tested the nightly build)
> >>> [attachment on SourceForge]
> >>>
> >>
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1787501&file_id=244104
> >>> orarrp.pdf (application/pdf), 7871 bytes
> >>> pdf with print problem
> >> --
> >> This message is automatically generated by JIRA.
> >> -
> >> You can reply to this email to add a comment to the issue online.
> >>
> >>
> >
>

Re: [jira] Resolved: (PDFBOX-302) Improve font handling

Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi Tony,

is it possible to provide us with a sample document to test your patch?
As attachements aren't allowed on the list, you have to create a new
issue on JIRA and attach your sample.

Thnkas in advance,
Andreas Lehmkühler

Tony Scerri schrieb:
> Not sure if this is a possible cause for issues others have reported. I
> found that when creating images from PDFs I was getting a lot of jumbled
> text, bits overlapping others etc, and generaly it looked wrong. Turns out
> after much digging and tinkering that the FontManager was returning the
> wrong font even for standard fonts available in most environments.
> 
> The fix I put in was inside the iterations of the available AWT fonts inside
> the loadFonts method of FontManager. The last line of the for loop I added:
> 
>             envFonts.put(normalizeFontname(font.getPSName()),font);
> 
> This puts in the post script name which is quite often used inside PDFs from
> what I have been seeing lately on my work. This now has a much better chance
> of looking up the correct font. I now dont have overlapped words etc because
> the font has a much better metric with what was expected.
> 
> I think this problem may be more prevelant on PDFs where the text has been
> fully justified. I have run into a subsequent issues still plodding my way
> through. Which is that I'm now left with large gaps in lines in the middle
> of words because PDF box isnt rendering the word spacing correctly (might
> also be character spacing) which is all down to the use of AWT rendering of
> fonts which as far as I can tell wont allow for the kinds of control
> required when rendering a whole string, the alternative seems to be to have
> to render each character one by one with the appropriate displacement
> between each glyph.
> 
> Tony
> 
> On Wed, Sep 2, 2009 at 6:47 AM, Andreas Lehmkühler (JIRA)
> <ji...@apache.org>wrote:
> 
>>     [
>> https://issues.apache.org/jira/browse/PDFBOX-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Andreas Lehmkühler resolved PDFBOX-302.
>> ---------------------------------------
>>
>>       Resolution: Fixed
>>    Fix Version/s: 0.8.0-incubator
>>
>> AFAIK there aren't any issues with this improvement, so that I'll set this
>> to resolved.
>>
>> For now there aren't any mappings mssing. If we find some later, it'll be
>> no problem to add them.
>>
>>> Improve font handling (was: layout print problem)
>>> -------------------------------------------------
>>>
>>>                 Key: PDFBOX-302
>>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-302
>>>             Project: PDFBox
>>>          Issue Type: Improvement
>>>          Components: PDFReader
>>>            Reporter: Jukka Zitting
>>>            Assignee: Andreas Lehmkühler
>>>            Priority: Minor
>>>             Fix For: 0.8.0-incubator
>>>
>>>
>>> [imported from SourceForge]
>>>
>> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1787501
>>> Originally submitted by gjniewenhuijse on 2007-09-04 00:24.
>>> When i print the attached file, some things are not printed well.
>>> - The gray box at the top
>>> - and the fonts are printed bold and thats not right.
>>> Is there any solution for now, or for later?
>>> When i open and print this file with adobe reader, everything is fine,
>> but with pdfbox i've got a layout problem.
>>> I used the newest pdfbox version (also tested the nightly build)
>>> [attachment on SourceForge]
>>>
>> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1787501&file_id=244104
>>> orarrp.pdf (application/pdf), 7871 bytes
>>> pdf with print problem
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
> 

Re: [jira] Resolved: (PDFBOX-302) Improve font handling

Posted by Andreas Lehmkühler <an...@lehmi.de>.
Hi Tony,

first of all thanks for investigation in this subject. Please attach
your patch to PDFBOX-508 if possible so that we are able to compare
Dmitrys and your solution. Perhaps a combination of both will solve all
parts of that issue.

Thanks in advance,

Andreas Lehmkühler

Tony Scerri schrieb:
> Following on from this I now have the character spacing and word spacing
> being done in image writing and the output looks almost identical to the PDF
> viewed in Adobe Reader (wrt to text rendering including layout). It was a
> bit of a desperate approach but shows the results can be achieved. It
> appears to be a similar fix to that suggested in Jira PDFBOX-508, but I only
> needed to modify the PDFStreamEngine.java class. I changed the
> processEncodedText method to simply process the text position of each
> character found in the stream.
> 
> The only undesirable consequence would have to be performance as this will
> trigger one call back to processTextPosition for each character rather than
> a sequence, but given this would appear to be the only reliable way to
> establish where each character should be placed I'm not sure what the
> alternative would be.
> 
> Like I said I didnt modify anything else get this going, and text extraction
> wasnt effected when sorting by position for horizontal text. For diagonal
> text going up from bottom left to top right things changed, but the original
> wasnt perfect and it came from text pieces in an embed image (EPS). What I
> got out after the change was the text being read from bottom to top, going
> left to right, so a vertical read and the characters came out in the right
> order by position in that orientation, so that would be a differen problem
> to solve.
> 
> On Mon, Sep 7, 2009 at 4:40 PM, Tony Scerri <to...@gmail.com> wrote:
> 
>> Not sure if this is a possible cause for issues others have reported. I
>> found that when creating images from PDFs I was getting a lot of jumbled
>> text, bits overlapping others etc, and generaly it looked wrong. Turns out
>> after much digging and tinkering that the FontManager was returning the
>> wrong font even for standard fonts available in most environments.
>>
>> The fix I put in was inside the iterations of the available AWT fonts
>> inside the loadFonts method of FontManager. The last line of the for loop I
>> added:
>>
>>             envFonts.put(normalizeFontname(font.getPSName()),font);
>>
>> This puts in the post script name which is quite often used inside PDFs
>> from what I have been seeing lately on my work. This now has a much better
>> chance of looking up the correct font. I now dont have overlapped words etc
>> because the font has a much better metric with what was expected.
>>
>> I think this problem may be more prevelant on PDFs where the text has been
>> fully justified. I have run into a subsequent issues still plodding my way
>> through. Which is that I'm now left with large gaps in lines in the middle
>> of words because PDF box isnt rendering the word spacing correctly (might
>> also be character spacing) which is all down to the use of AWT rendering of
>> fonts which as far as I can tell wont allow for the kinds of control
>> required when rendering a whole string, the alternative seems to be to have
>> to render each character one by one with the appropriate displacement
>> between each glyph.
>>
>> Tony
>>
>>
>> On Wed, Sep 2, 2009 at 6:47 AM, Andreas Lehmkühler (JIRA) <jira@apache.org
>>> wrote:
>>>     [
>>> https://issues.apache.org/jira/browse/PDFBOX-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>>
>>> Andreas Lehmkühler resolved PDFBOX-302.
>>> ---------------------------------------
>>>
>>>       Resolution: Fixed
>>>    Fix Version/s: 0.8.0-incubator
>>>
>>> AFAIK there aren't any issues with this improvement, so that I'll set this
>>> to resolved.
>>>
>>> For now there aren't any mappings mssing. If we find some later, it'll be
>>> no problem to add them.
>>>
>>>> Improve font handling (was: layout print problem)
>>>> -------------------------------------------------
>>>>
>>>>                 Key: PDFBOX-302
>>>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-302
>>>>             Project: PDFBox
>>>>          Issue Type: Improvement
>>>>          Components: PDFReader
>>>>            Reporter: Jukka Zitting
>>>>            Assignee: Andreas Lehmkühler
>>>>            Priority: Minor
>>>>             Fix For: 0.8.0-incubator
>>>>
>>>>
>>>> [imported from SourceForge]
>>>>
>>> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1787501
>>>> Originally submitted by gjniewenhuijse on 2007-09-04 00:24.
>>>> When i print the attached file, some things are not printed well.
>>>> - The gray box at the top
>>>> - and the fonts are printed bold and thats not right.
>>>> Is there any solution for now, or for later?
>>>> When i open and print this file with adobe reader, everything is fine,
>>> but with pdfbox i've got a layout problem.
>>>> I used the newest pdfbox version (also tested the nightly build)
>>>> [attachment on SourceForge]
>>>>
>>> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1787501&file_id=244104
>>>> orarrp.pdf (application/pdf), 7871 bytes
>>>> pdf with print problem
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
> 

Re: [jira] Resolved: (PDFBOX-302) Improve font handling (was: layout print problem)

Posted by Tony Scerri <to...@gmail.com>.
Following on from this I now have the character spacing and word spacing
being done in image writing and the output looks almost identical to the PDF
viewed in Adobe Reader (wrt to text rendering including layout). It was a
bit of a desperate approach but shows the results can be achieved. It
appears to be a similar fix to that suggested in Jira PDFBOX-508, but I only
needed to modify the PDFStreamEngine.java class. I changed the
processEncodedText method to simply process the text position of each
character found in the stream.

The only undesirable consequence would have to be performance as this will
trigger one call back to processTextPosition for each character rather than
a sequence, but given this would appear to be the only reliable way to
establish where each character should be placed I'm not sure what the
alternative would be.

Like I said I didnt modify anything else get this going, and text extraction
wasnt effected when sorting by position for horizontal text. For diagonal
text going up from bottom left to top right things changed, but the original
wasnt perfect and it came from text pieces in an embed image (EPS). What I
got out after the change was the text being read from bottom to top, going
left to right, so a vertical read and the characters came out in the right
order by position in that orientation, so that would be a differen problem
to solve.

On Mon, Sep 7, 2009 at 4:40 PM, Tony Scerri <to...@gmail.com> wrote:

> Not sure if this is a possible cause for issues others have reported. I
> found that when creating images from PDFs I was getting a lot of jumbled
> text, bits overlapping others etc, and generaly it looked wrong. Turns out
> after much digging and tinkering that the FontManager was returning the
> wrong font even for standard fonts available in most environments.
>
> The fix I put in was inside the iterations of the available AWT fonts
> inside the loadFonts method of FontManager. The last line of the for loop I
> added:
>
>             envFonts.put(normalizeFontname(font.getPSName()),font);
>
> This puts in the post script name which is quite often used inside PDFs
> from what I have been seeing lately on my work. This now has a much better
> chance of looking up the correct font. I now dont have overlapped words etc
> because the font has a much better metric with what was expected.
>
> I think this problem may be more prevelant on PDFs where the text has been
> fully justified. I have run into a subsequent issues still plodding my way
> through. Which is that I'm now left with large gaps in lines in the middle
> of words because PDF box isnt rendering the word spacing correctly (might
> also be character spacing) which is all down to the use of AWT rendering of
> fonts which as far as I can tell wont allow for the kinds of control
> required when rendering a whole string, the alternative seems to be to have
> to render each character one by one with the appropriate displacement
> between each glyph.
>
> Tony
>
>
> On Wed, Sep 2, 2009 at 6:47 AM, Andreas Lehmkühler (JIRA) <jira@apache.org
> > wrote:
>
>>
>>     [
>> https://issues.apache.org/jira/browse/PDFBOX-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>
>> Andreas Lehmkühler resolved PDFBOX-302.
>> ---------------------------------------
>>
>>       Resolution: Fixed
>>    Fix Version/s: 0.8.0-incubator
>>
>> AFAIK there aren't any issues with this improvement, so that I'll set this
>> to resolved.
>>
>> For now there aren't any mappings mssing. If we find some later, it'll be
>> no problem to add them.
>>
>> > Improve font handling (was: layout print problem)
>> > -------------------------------------------------
>> >
>> >                 Key: PDFBOX-302
>> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-302
>> >             Project: PDFBox
>> >          Issue Type: Improvement
>> >          Components: PDFReader
>> >            Reporter: Jukka Zitting
>> >            Assignee: Andreas Lehmkühler
>> >            Priority: Minor
>> >             Fix For: 0.8.0-incubator
>> >
>> >
>> > [imported from SourceForge]
>> >
>> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1787501
>> > Originally submitted by gjniewenhuijse on 2007-09-04 00:24.
>> > When i print the attached file, some things are not printed well.
>> > - The gray box at the top
>> > - and the fonts are printed bold and thats not right.
>> > Is there any solution for now, or for later?
>> > When i open and print this file with adobe reader, everything is fine,
>> but with pdfbox i've got a layout problem.
>> > I used the newest pdfbox version (also tested the nightly build)
>> > [attachment on SourceForge]
>> >
>> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1787501&file_id=244104
>> > orarrp.pdf (application/pdf), 7871 bytes
>> > pdf with print problem
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>

Re: [jira] Resolved: (PDFBOX-302) Improve font handling (was: layout print problem)

Posted by Tony Scerri <to...@gmail.com>.
Not sure if this is a possible cause for issues others have reported. I
found that when creating images from PDFs I was getting a lot of jumbled
text, bits overlapping others etc, and generaly it looked wrong. Turns out
after much digging and tinkering that the FontManager was returning the
wrong font even for standard fonts available in most environments.

The fix I put in was inside the iterations of the available AWT fonts inside
the loadFonts method of FontManager. The last line of the for loop I added:

            envFonts.put(normalizeFontname(font.getPSName()),font);

This puts in the post script name which is quite often used inside PDFs from
what I have been seeing lately on my work. This now has a much better chance
of looking up the correct font. I now dont have overlapped words etc because
the font has a much better metric with what was expected.

I think this problem may be more prevelant on PDFs where the text has been
fully justified. I have run into a subsequent issues still plodding my way
through. Which is that I'm now left with large gaps in lines in the middle
of words because PDF box isnt rendering the word spacing correctly (might
also be character spacing) which is all down to the use of AWT rendering of
fonts which as far as I can tell wont allow for the kinds of control
required when rendering a whole string, the alternative seems to be to have
to render each character one by one with the appropriate displacement
between each glyph.

Tony

On Wed, Sep 2, 2009 at 6:47 AM, Andreas Lehmkühler (JIRA)
<ji...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/PDFBOX-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Andreas Lehmkühler resolved PDFBOX-302.
> ---------------------------------------
>
>       Resolution: Fixed
>    Fix Version/s: 0.8.0-incubator
>
> AFAIK there aren't any issues with this improvement, so that I'll set this
> to resolved.
>
> For now there aren't any mappings mssing. If we find some later, it'll be
> no problem to add them.
>
> > Improve font handling (was: layout print problem)
> > -------------------------------------------------
> >
> >                 Key: PDFBOX-302
> >                 URL: https://issues.apache.org/jira/browse/PDFBOX-302
> >             Project: PDFBox
> >          Issue Type: Improvement
> >          Components: PDFReader
> >            Reporter: Jukka Zitting
> >            Assignee: Andreas Lehmkühler
> >            Priority: Minor
> >             Fix For: 0.8.0-incubator
> >
> >
> > [imported from SourceForge]
> >
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1787501
> > Originally submitted by gjniewenhuijse on 2007-09-04 00:24.
> > When i print the attached file, some things are not printed well.
> > - The gray box at the top
> > - and the fonts are printed bold and thats not right.
> > Is there any solution for now, or for later?
> > When i open and print this file with adobe reader, everything is fine,
> but with pdfbox i've got a layout problem.
> > I used the newest pdfbox version (also tested the nightly build)
> > [attachment on SourceForge]
> >
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1787501&file_id=244104
> > orarrp.pdf (application/pdf), 7871 bytes
> > pdf with print problem
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>