You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Hamed Iravanchi (Created) (JIRA)" <ji...@apache.org> on 2012/01/30 17:06:11 UTC

[jira] [Created] (PDFBOX-1216) Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image

Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image
--------------------------------------------------------------------------------

Key: PDFBOX-1216
URL: https://issues.apache.org/jira/browse/PDFBOX-1216
Project: PDFBox
Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Hamed Iravanchi

When the PDF file contains Arabic / Farsi text, they appear disconnected when converting pages to image.
Arabic / Farsi letters are connected to each other when written.

Additionally, the error message "Changing font on <?> from <B Lotus> to the default font" appears on the console.
As I tried to debug the issue, it is because PDFBox is looking into the embedded fonts for the "isolated" variation of the character, where the embedded font only includes "connected" variation.
If the embedded font contains the isolated format too, the font is displayed correctly (the warning message doesn't appear for that character), but the character is displayed as the incorrect variation (i.e. isolated instead of connected)

This happens in both 1.6.0 release and the latest trunk code (as of today). I didn't test previous versions.
The difference is that in 1.6.0, the default font (that is substituted as mentioned above) contains the Arabic / Persian characters, but in the trunk, the replaced characters are displayed as squares.

I will attach a PDF as an input for reproducing the issue.

Note: this might be related to issue PDFBOX-1127, but that one regards text extraction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (PDFBOX-1216) Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image

Posted by "Andreas Lehmkühler (Assigned JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler reassigned PDFBOX-1216:
------------------------------------------

    Assignee: Andreas Lehmkühler
    
> Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1216
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1216
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Hamed Iravanchi
>            Assignee: Andreas Lehmkühler
>         Attachments: a.pdf
>
>
> When the PDF file contains Arabic / Farsi text, they appear disconnected when converting pages to image.
> Arabic / Farsi letters are connected to each other when written.
> Additionally, the error message "Changing font on <?> from <B Lotus> to the default font" appears on the console.
> As I tried to debug the issue, it is because PDFBox is looking into the embedded fonts for the "isolated" variation of the character, where the embedded font only includes "connected" variation.
> If the embedded font contains the isolated format too, the font is displayed correctly (the warning message doesn't appear for that character), but the character is displayed as the incorrect variation (i.e. isolated instead of connected)
> This happens in both 1.6.0 release and the latest trunk code (as of today). I didn't test previous versions.
> The difference is that in 1.6.0, the default font (that is substituted as mentioned above) contains the Arabic / Persian characters, but in the trunk, the replaced characters are displayed as squares.
> I will attach a PDF as an input for reproducing the issue.
> Note: this might be related to issue PDFBOX-1127, but that one regards text extraction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1216) Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image

Posted by "Andreas Lehmkühler (Updated JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-1216:
---------------------------------------

    Attachment: PDFBOX1216-a1.png
    
> Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1216
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1216
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Hamed Iravanchi
>            Assignee: Andreas Lehmkühler
>         Attachments: PDFBOX1216-a1.png, a.pdf
>
>
> When the PDF file contains Arabic / Farsi text, they appear disconnected when converting pages to image.
> Arabic / Farsi letters are connected to each other when written.
> Additionally, the error message "Changing font on <?> from <B Lotus> to the default font" appears on the console.
> As I tried to debug the issue, it is because PDFBox is looking into the embedded fonts for the "isolated" variation of the character, where the embedded font only includes "connected" variation.
> If the embedded font contains the isolated format too, the font is displayed correctly (the warning message doesn't appear for that character), but the character is displayed as the incorrect variation (i.e. isolated instead of connected)
> This happens in both 1.6.0 release and the latest trunk code (as of today). I didn't test previous versions.
> The difference is that in 1.6.0, the default font (that is substituted as mentioned above) contains the Arabic / Persian characters, but in the trunk, the replaced characters are displayed as squares.
> I will attach a PDF as an input for reproducing the issue.
> Note: this might be related to issue PDFBOX-1127, but that one regards text extraction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (PDFBOX-1216) Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image

Posted by "Andreas Lehmkühler (Resolved JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-1216.
----------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.7.0

I fixed this issue in revision 1296818.

I implemented/improved the following:

- PDSimpleFont#drawString now uses glyphs for rendering
- unencoded values are used for the rendering of CID encoded fonts
- introduced a new value to povide the information if the AWT-font was substituted or not

                
> Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1216
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1216
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Hamed Iravanchi
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.7.0
>
>         Attachments: PDFBOX1216-a1.png, a.pdf
>
>
> When the PDF file contains Arabic / Farsi text, they appear disconnected when converting pages to image.
> Arabic / Farsi letters are connected to each other when written.
> Additionally, the error message "Changing font on <?> from <B Lotus> to the default font" appears on the console.
> As I tried to debug the issue, it is because PDFBox is looking into the embedded fonts for the "isolated" variation of the character, where the embedded font only includes "connected" variation.
> If the embedded font contains the isolated format too, the font is displayed correctly (the warning message doesn't appear for that character), but the character is displayed as the incorrect variation (i.e. isolated instead of connected)
> This happens in both 1.6.0 release and the latest trunk code (as of today). I didn't test previous versions.
> The difference is that in 1.6.0, the default font (that is substituted as mentioned above) contains the Arabic / Persian characters, but in the trunk, the replaced characters are displayed as squares.
> I will attach a PDF as an input for reproducing the issue.
> Note: this might be related to issue PDFBOX-1127, but that one regards text extraction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1216) Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image

Posted by "Hamed Iravanchi (Updated) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hamed Iravanchi updated PDFBOX-1216:
------------------------------------

    Attachment: a.pdf
    
> Arabic / Farsi (Persian) text appear disconnected when PDF is converted to image
> --------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1216
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1216
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.6.0
>            Reporter: Hamed Iravanchi
>         Attachments: a.pdf
>
>
> When the PDF file contains Arabic / Farsi text, they appear disconnected when converting pages to image.
> Arabic / Farsi letters are connected to each other when written.
> Additionally, the error message "Changing font on <?> from <B Lotus> to the default font" appears on the console.
> As I tried to debug the issue, it is because PDFBox is looking into the embedded fonts for the "isolated" variation of the character, where the embedded font only includes "connected" variation.
> If the embedded font contains the isolated format too, the font is displayed correctly (the warning message doesn't appear for that character), but the character is displayed as the incorrect variation (i.e. isolated instead of connected)
> This happens in both 1.6.0 release and the latest trunk code (as of today). I didn't test previous versions.
> The difference is that in 1.6.0, the default font (that is substituted as mentioned above) contains the Arabic / Persian characters, but in the trunk, the replaced characters are displayed as squares.
> I will attach a PDF as an input for reproducing the issue.
> Note: this might be related to issue PDFBOX-1127, but that one regards text extraction.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira