You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (Resolved JIRA)" <ji...@apache.org> on 2012/01/22 14:26:41 UTC

[jira] [Resolved] (PDFBOX-610) Fonts should not be cached by PDFStreamEngine

     [ https://issues.apache.org/jira/browse/PDFBOX-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-610.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0
         Assignee: Andreas Lehmkühler

I have to agree with Peter. To cache a font one has to ensure that fonts are 100% equal, which is possible but complicated. It's not enough to just compare the name, the subtype and the encoding. I stumbled upon this issue when rendering the Centerplan.pdf attached to PDFBOX-615.

I removed the font caching in revision 1234506. I improved and hopefully simplified the handling of resources of a pdf as well. On one hand these changes may have an negative impact on the performance because of the missing font cache, but on the other hand all fonts are handled correct now.
                
> Fonts should not be cached by PDFStreamEngine
> ---------------------------------------------
>
>                 Key: PDFBOX-610
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-610
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: Win or Linux
>            Reporter: Peter Costello
>            Assignee: Andreas Lehmkühler
>              Labels: PDFStreamEngine, fontwidth
>             Fix For: 1.7.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.pdfbox.util.PDFStreamEngine
>    Fonts are cached using variable 'private Map documentFontCache = new HashMap();'
>    which is used in method 'processSubStream()' and the call 'sr.fonts = resources.getFonts(documentFontCache);
> The problem is that PDF documents can store a limited range of 'firstChar' and 'lastChar' (maybe just a space char),  and then expand that range at a later point within the same page. When the font is cached, those updates are ignored. 
> In particular, test  'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf, pg 1'.   
> Using font caching, the widths of the characters in the upper right corner of the page are reported as zero, and the text extraction and text merging is compromised.
> Without font caching, the widths are correct. There are other examples that cause the same problem.
> To fix the problem change the call in method 'processSubStream()' to:
>              sr.fonts = resources.getFonts(null);
> There was some effort put into font caching.  Unfortunately, it should not be used on unknown documents.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira