You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2014/10/23 19:19:34 UTC

[jira] [Closed] (PDFBOX-1222) PDFs created with idealsoftware.com's VPE are all wrong

     [ https://issues.apache.org/jira/browse/PDFBOX-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler closed PDFBOX-1222.
--------------------------------------
       Resolution: Fixed
    Fix Version/s: 1.7.0
         Assignee: Andreas Lehmkühler

The text extraction works fine since PDFBox 1.7.0. The "The Comparison method violates its general contract" no longer appears starting with 1.7.0 too.


> PDFs created with idealsoftware.com's VPE are all wrong
> -------------------------------------------------------
>
>                 Key: PDFBOX-1222
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1222
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.6.0
>            Reporter: Radek
>            Assignee: Andreas Lehmkühler
>             Fix For: 1.7.0
>
>         Attachments: rtf.pdf
>
>
> Follow the steps:
> 1. Download the example pdf I'll attach. It's the same as "example rich text format" pdf from idealsoftware.com but with text extraction protection disabled.
> 2a. java -jar pdfbox-app-1.6.0.jar ExtractText -sort rtf.pdf extr.txt
> Actual results:
> Text is all gibberish. If you look at it very carefully, sorting "reads" the text vertically and you find first characters of each line first, then second characters of each line, etc.
> Moreover, on jdk7: java.lang.IllegalArgumentException: Comparison method violates its general contract! (that's the text position sorting comparator)
> Poking around the code indicates that sorting is correct *if* character rotation was 270 degrees. It (correctly?) calculates it as zero instead.
> 2b. java -jar pdfbox-app-1.6.0.jar ExtractText rtf.pdf extr.txt
> Actual results:
> Text is fine, but each page is glued to a single line. Poking around the code indicates that character offsets go down correctly, but expected line height is huge (full page height or width?) and therefore they never go down sufficiently to trigger a newline detection.
> So, there's something very wrong with character positions in those files, making pdfbox not extract text correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)