You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2009/11/28 12:38:20 UTC

[jira] Resolved: (PDFBOX-571) Dubious handling of word spacing (Tw)

     [ https://issues.apache.org/jira/browse/PDFBOX-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler resolved PDFBOX-571.
---------------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0

Villus explanation seems reasonable to me. So, I've tested the patch and it works fine.

- the rendering of the attached sample pdf is more acurate (not perfect, but better)
- the extracted text of the attached sample pdf is more acurate too
- the other test cases are working like before

Thanks to Villu for his contribution

> Dubious handling of word spacing (Tw)
> -------------------------------------
>
>                 Key: PDFBOX-571
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-571
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction, Utilities
>    Affects Versions: 0.8.0-incubator
>            Reporter: Villu Ruusmann
>             Fix For: 1.0.0
>
>         Attachments: PDFStreamEngine.patch, pg_0005.pdf, pg_0005_selectall.png
>
>
> Wanted to provide a contrary case for the current handling of word spacing.
> The sample page (pg_0005.pdf) uses a Type1C font for text rendering. The problem is that this Type1C font uses a custom encoding where the code values are assigned sequentially starting from the code value of 1. Thus the code value 32 is assigned to a digit "3", not to a space character " " as one would expect.
> The PDF producer software has (mis-)used word spacing to break up longer character sequences. For example, on table line 3, the character sequence "0.831.05" is broken into two cells "0.83" and "1.05". Other uses of this "optimization" can be seen when the sample page is opened in Acrobat Reader (tested on version 7.0) and the "Select all" operation is performed. I've attached the screenshot of Acrobat Reader (pg_0005_selectall.png) for your convenience.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.