You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "John Hewson (JIRA)" <ji...@apache.org> on 2014/09/24 20:31:35 UTC

[jira] [Comment Edited] (PDFBOX-2377) Apparent regression in character mapping in a few files from govdocs1

    [ https://issues.apache.org/jira/browse/PDFBOX-2377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146689#comment-14146689 ] 

John Hewson edited comment on PDFBOX-2377 at 9/24/14 6:31 PM:
--------------------------------------------------------------

Looking at the commits from PDFBOX-2247, the use of FirstChar is not correct. FirstChar is not involved in encoding at all, it is used only for retrieving glyph widths from the Widths array. The charOffset variable should be removed, and the "code" variable in getCharacter() should be left un-tweaked.

It looks like getFontWidth isn't using the Widths array at all, which might be the cause of the original problem? Or there may be deeper encoding issues with 1.8. None of this applies to the trunk anymore.


was (Author: jahewson):
Looking at the commits from PDFBOX-2247, the use of FirstChar is not correct. FirstChar is not involved in encoding at all, it is used only for retrieving glyph widths from the Widths array. The charOffset variable should be removed, and the code in getCharacter() should be left unmodified.

It looks like getFontWidth isn't using the Widths array at all, which might be the cause of the original problem? Or there may be deeper encoding issues with 1.8. None of this applies to the trunk anymore.

> Apparent regression in character mapping in a few files from govdocs1
> ---------------------------------------------------------------------
>
>                 Key: PDFBOX-2377
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2377
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.7
>            Reporter: Tim Allison
>            Priority: Minor
>              Labels: regression
>         Attachments: 312888.pdf, 764929.pdf
>
>
> On a small number of test files in a 50k sample of pdfs from govdocs1, it appears that some characters are no longer being extracted correctly in 1.8.7 when compared to 1.8.6.  I ran pdfbox's app.jar with ExtractText
> {noformat}
> 764949.pdf
> 1.8.6: Lang, Astrophysical Data: Planets and Stars
> 1.8.7: Lang, AefdaphyeiUSl DSfS: PlSnefe Snd EfSde,
> {noformat}
> and
> {noformat}
> 312888.pdf
> 1.8.6: Self-Assessment \u0026 Capability Description
> 1.8.7: Seff-Ammemmmehn \u0026 Cajabcfcns Demclcjncih
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)