You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2009/05/10 13:55:45 UTC

[jira] Issue Comment Edited: (PDFBOX-358) Vertical text extraction splitting text

    [ https://issues.apache.org/jira/browse/PDFBOX-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707780#action_12707780 ] 

Andreas Lehmkühler edited comment on PDFBOX-358 at 5/10/09 4:55 AM:
--------------------------------------------------------------------

Hi Daniel,

I've the same effect on converting flyer2.pdf and mtxFidelity.pdf from PDFBOX-51. The problem was the AffineTransform which was used to rotate the page. I've exchanged that code with version 773325 and now converting works for both documents.

      was (Author: lehmi):
    Hi Daniel,

I've the same effect on converting flyer2.pdf and mtxFidelity.pdf from PDFBOX51. The problem was the AffineTransform which was used to rotate the page. I've exchanged that code with version 773325 and now converting works for both documents.
  
> Vertical text extraction splitting text
> ---------------------------------------
>
>                 Key: PDFBOX-358
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>             Fix For: 0.8.0-incubator
>
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832
> Vertical text gets splitted during extraction using PDFTextStripper.
> "Specification" gives:
> Spécif
> ic
> ations
> This is made worse when sorted by position, as it gets mixed up with the
> horizontal text:
> ic
> ations
> [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> Spécif [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> I'm afraid I can't provide the PDF in question due to confidentiality
> requirements. It's a PDF obtained from the conversion to PDF of a Windows
> Word document. According to the forums I'm not the only one with this
> problem.
> [Comment on SourceForge]
> Date: 2008-06-02 09:11
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02 10:30
> [Comment on SourceForge]
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> I have attached an admittedly very complicated PDF document which (as far
> as I can tell) features 90° and 135° rotated text in a 90° rotated page.
> Position-ordered text extraction gives horrible results. 
> Normal text extraction is also very messy, although in this second case
> the results are almost understandable. 
> This is not the document I need to treat, but i think that if text can be
> correctly extracted from that PDF, it should work for almost every other
> existing PDF.
> File Added: Flyer2.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.