You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Andreas Lehmkühler (JIRA)" <ji...@apache.org> on 2009/01/06 19:32:44 UTC
[jira] Commented: (PDFBOX-358) Vertical text extraction splitting text

    [ https://issues.apache.org/jira/browse/PDFBOX-358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12661242#action_12661242 ] 

Andreas Lehmkühler commented on PDFBOX-358:
-------------------------------------------

Version 732038 contains a patch to solve some displaying issues if the rotation-angle is not a multiple of 90 degrees.
I'll try the stripping-part later.

> Vertical text extraction splitting text
> ---------------------------------------
>
>                 Key: PDFBOX-358
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-358
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>
> [Issue from SourceForge]
> http://sourceforge.net/tracker/index.php?func=detail&aid=1981851&group_id=78314&atid=552832
> Vertical text gets splitted during extraction using PDFTextStripper.
> "Specification" gives:
> Spécif
> ic
> ations
> This is made worse when sorted by position, as it gets mixed up with the
> horizontal text:
> ic
> ations
> [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> Spécif [CLASSIFIED INFO]
> [CLASSIFIED INFO]
> I'm afraid I can't provide the PDF in question due to confidentiality
> requirements. It's a PDF obtained from the conversion to PDF of a Windows
> Word document. According to the forums I'm not the only one with this
> problem.
> [Comment on SourceForge]
> Date: 2008-06-02 09:11
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> To clarify, the text in question is rotated by 90° counter-clockwise.Date: 2008-06-02 10:30
> [Comment on SourceForge]
> Sender: totoll
> Logged In: YES 
> user_id=2096423
> Originator: YES
> I have attached an admittedly very complicated PDF document which (as far
> as I can tell) features 90° and 135° rotated text in a 90° rotated page.
> Position-ordered text extraction gives horrible results. 
> Normal text extraction is also very messy, although in this second case
> the results are almost understandable. 
> This is not the document I need to treat, but i think that if text can be
> correctly extracted from that PDF, it should work for almost every other
> existing PDF.
> File Added: Flyer2.pdf
> http://sourceforge.net/tracker/download.php?group_id=78314&atid=552832&file_id=279847&aid=1981851

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.