You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Peter Costello (JIRA)" <ji...@apache.org> on 2010/02/07 07:08:27 UTC

[jira] Created: (PDFBOX-610) Fonts should not be cached by PDFStreamEngine

Fonts should not be cached by PDFStreamEngine
---------------------------------------------

                 Key: PDFBOX-610
                 URL: https://issues.apache.org/jira/browse/PDFBOX-610
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 0.8.0-incubator
         Environment: Win or Linux
            Reporter: Peter Costello
             Fix For: 0.8.0-incubator


org.apache.pdfbox.util.PDFStreamEngine
   Fonts are cached using variable 'private Map documentFontCache = new HashMap();'
   which is used in method 'processSubStream()' and the call 'sr.fonts = resources.getFonts(documentFontCache);

The problem is that PDF documents can store a limited range of 'firstChar' and 'lastChar' (maybe just a space char),  and then expand that range at a later point within the same page. When the font is cached, those updates are ignored. 

In particular, test  'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf, pg 1'.   
Using font caching, the widths of the characters in the upper right corner of the page are reported as zero, and the text extraction and text merging is compromised.
Without font caching, the widths are correct. There are other examples that cause the same problem.

To fix the problem change the call in method 'processSubStream()' to:
             sr.fonts = resources.getFonts(null);

There was some effort put into font caching.  Unfortunately, it should not be used on unknown documents.
 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (PDFBOX-610) Fonts should not be cached by PDFStreamEngine

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/PDFBOX-610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-610:
--------------------------------------

    Fix Version/s:     (was: 0.8.0-incubator)

> Fonts should not be cached by PDFStreamEngine
> ---------------------------------------------
>
>                 Key: PDFBOX-610
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-610
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>         Environment: Win or Linux
>            Reporter: Peter Costello
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.pdfbox.util.PDFStreamEngine
>    Fonts are cached using variable 'private Map documentFontCache = new HashMap();'
>    which is used in method 'processSubStream()' and the call 'sr.fonts = resources.getFonts(documentFontCache);
> The problem is that PDF documents can store a limited range of 'firstChar' and 'lastChar' (maybe just a space char),  and then expand that range at a later point within the same page. When the font is cached, those updates are ignored. 
> In particular, test  'http://www.encana.com/investor/financial/shareholder/pdfs/info-circular-french.pdf, pg 1'.   
> Using font caching, the widths of the characters in the upper right corner of the page are reported as zero, and the text extraction and text merging is compromised.
> Without font caching, the widths are correct. There are other examples that cause the same problem.
> To fix the problem change the call in method 'processSubStream()' to:
>              sr.fonts = resources.getFonts(null);
> There was some effort put into font caching.  Unfortunately, it should not be used on unknown documents.
>  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.