You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Pavel Misurkin (JIRA)" <ji...@apache.org> on 2014/12/25 12:07:13 UTC

[jira] [Updated] (PDFBOX-2584) Text extraction reports zero character widths

     [ https://issues.apache.org/jira/browse/PDFBOX-2584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pavel Misurkin updated PDFBOX-2584:
-----------------------------------
    Attachment: stip_2c.pdf

Sample file to demonstrate text extraction problem

> Text extraction reports zero character widths 
> ----------------------------------------------
>
>                 Key: PDFBOX-2584
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2584
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.8.8, 2.0.0
>            Reporter: Pavel Misurkin
>         Attachments: stip_2c.pdf
>
>
> We are using PDFBox API to get position of characters within a document
> Have found a problem with one document:: text extraction properly extracting text but set all character's width to zero
> Code is pretty simple
> {code}
>             File input = new File("stip_2c.pdf");
>             document = PDDocument.load(input);
>             
>             PDFTextStripper extractor = new PDFTextStripper();
>             Writer output = new StringWriter();
>             extractor.writeText(document, output);
> {code}
> We are examining then value of Extractor.charactersByArticle member for characters widths
> - Have found the issue in 1.8.4
> all chars widths were == zero
> - in version 1.8.8
> all chars widths were == zero except whitespaces.
> See new validation added in 1.8.8
> File 
> pdfbox-1.8.8-src\pdfbox\src\main\java\org\apache\pdfbox\util\PDFStreamEngine.java
> line 369
> {code}        if (spaceWidthText == 0)
>         {
>             spaceWidthText = 1.0f; // if could not find font, use a generic value
>         }        {code}
> - in version 2.0.0 problem still exists



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)