You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2010/02/07 17:02:27 UTC

[jira] Updated: (PDFBOX-610) Fonts should not be cached by PDFStreamEngine

     [ https://issues.apache.org/jira/browse/PDFBOX-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-610:
--------------------------------------

    Fix Version/s:     (was: 0.8.0-incubator)

> Fonts should not be cached by PDFStreamEngine
> ---------------------------------------------
>
>                 Key: PDFBOX-610
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-610
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: Win or Linux
>            Reporter: Peter Costello
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.pdfbox.util.PDFStreamEngine
>    Fonts are cached using variable 'private Map documentFontCache = new HashMap();'
>    which is used in method 'processSubStream()' and the call 'sr.fonts = resources.getFonts(documentFontCache);
> The problem is that PDF documents can store a limited range of 'firstChar' and 'lastChar' (maybe just a space char),  and then expand that range at a later point within the same page. When the font is cached, those updates are ignored. 
> In particular, test  'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf, pg 1'.   
> Using font caching, the widths of the characters in the upper right corner of the page are reported as zero, and the text extraction and text merging is compromised.
> Without font caching, the widths are correct. There are other examples that cause the same problem.
> To fix the problem change the call in method 'processSubStream()' to:
>              sr.fonts = resources.getFonts(null);
> There was some effort put into font caching.  Unfortunately, it should not be used on unknown documents.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.