You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by Andreas Lehmkühler <an...@lehmi.de> on 2009/09/07 22:27:51 UTC

Re: [jira] Resolved: (PDFBOX-302) Improve font handling

Hi Tony,

first of all thanks for investigation in this subject. Please attach
your patch to PDFBOX-508 if possible so that we are able to compare
Dmitrys and your solution. Perhaps a combination of both will solve all
parts of that issue.

Thanks in advance,

Andreas Lehmkühler

Tony Scerri schrieb:
> Following on from this I now have the character spacing and word spacing
> being done in image writing and the output looks almost identical to the PDF
> viewed in Adobe Reader (wrt to text rendering including layout). It was a
> bit of a desperate approach but shows the results can be achieved. It
> appears to be a similar fix to that suggested in Jira PDFBOX-508, but I only
> needed to modify the PDFStreamEngine.java class. I changed the
> processEncodedText method to simply process the text position of each
> character found in the stream.
> 
> The only undesirable consequence would have to be performance as this will
> trigger one call back to processTextPosition for each character rather than
> a sequence, but given this would appear to be the only reliable way to
> establish where each character should be placed I'm not sure what the
> alternative would be.
> 
> Like I said I didnt modify anything else get this going, and text extraction
> wasnt effected when sorting by position for horizontal text. For diagonal
> text going up from bottom left to top right things changed, but the original
> wasnt perfect and it came from text pieces in an embed image (EPS). What I
> got out after the change was the text being read from bottom to top, going
> left to right, so a vertical read and the characters came out in the right
> order by position in that orientation, so that would be a differen problem
> to solve.
> 
> On Mon, Sep 7, 2009 at 4:40 PM, Tony Scerri <to...@gmail.com> wrote:
> 
>> Not sure if this is a possible cause for issues others have reported. I
>> found that when creating images from PDFs I was getting a lot of jumbled
>> text, bits overlapping others etc, and generaly it looked wrong. Turns out
>> after much digging and tinkering that the FontManager was returning the
>> wrong font even for standard fonts available in most environments.
>>
>> The fix I put in was inside the iterations of the available AWT fonts
>> inside the loadFonts method of FontManager. The last line of the for loop I
>> added:
>>
>>             envFonts.put(normalizeFontname(font.getPSName()),font);
>>
>> This puts in the post script name which is quite often used inside PDFs
>> from what I have been seeing lately on my work. This now has a much better
>> chance of looking up the correct font. I now dont have overlapped words etc
>> because the font has a much better metric with what was expected.
>>
>> I think this problem may be more prevelant on PDFs where the text has been
>> fully justified. I have run into a subsequent issues still plodding my way
>> through. Which is that I'm now left with large gaps in lines in the middle
>> of words because PDF box isnt rendering the word spacing correctly (might
>> also be character spacing) which is all down to the use of AWT rendering of
>> fonts which as far as I can tell wont allow for the kinds of control
>> required when rendering a whole string, the alternative seems to be to have
>> to render each character one by one with the appropriate displacement
>> between each glyph.
>>
>> Tony
>>
>>
>> On Wed, Sep 2, 2009 at 6:47 AM, Andreas Lehmkühler (JIRA) <jira@apache.org
>>> wrote:
>>>     [
>>> https://issues.apache.org/jira/browse/PDFBOX-302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>>>
>>> Andreas Lehmkühler resolved PDFBOX-302.
>>> ---------------------------------------
>>>
>>>       Resolution: Fixed
>>>    Fix Version/s: 0.8.0-incubator
>>>
>>> AFAIK there aren't any issues with this improvement, so that I'll set this
>>> to resolved.
>>>
>>> For now there aren't any mappings mssing. If we find some later, it'll be
>>> no problem to add them.
>>>
>>>> Improve font handling (was: layout print problem)
>>>> -------------------------------------------------
>>>>
>>>>                 Key: PDFBOX-302
>>>>                 URL: https://issues.apache.org/jira/browse/PDFBOX-302
>>>>             Project: PDFBox
>>>>          Issue Type: Improvement
>>>>          Components: PDFReader
>>>>            Reporter: Jukka Zitting
>>>>            Assignee: Andreas Lehmkühler
>>>>            Priority: Minor
>>>>             Fix For: 0.8.0-incubator
>>>>
>>>>
>>>> [imported from SourceForge]
>>>>
>>> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552832&aid=1787501
>>>> Originally submitted by gjniewenhuijse on 2007-09-04 00:24.
>>>> When i print the attached file, some things are not printed well.
>>>> - The gray box at the top
>>>> - and the fonts are printed bold and thats not right.
>>>> Is there any solution for now, or for later?
>>>> When i open and print this file with adobe reader, everything is fine,
>>> but with pdfbox i've got a layout problem.
>>>> I used the newest pdfbox version (also tested the nightly build)
>>>> [attachment on SourceForge]
>>>>
>>> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&aid=1787501&file_id=244104
>>>> orarrp.pdf (application/pdf), 7871 bytes
>>>> pdf with print problem
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
>