You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pdfbox.apache.org by "Sandor Dj (JIRA)" <ji...@apache.org> on 2010/08/24 10:24:16 UTC

[jira] Issue Comment Edited: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901768#action_12901768 ] 

Sandor Dj edited comment on PDFBOX-800 at 8/24/10 4:22 AM:
-----------------------------------------------------------

As you can see there are some vertical textboxes in the middle of the page (pdf file).
Referring to the office document out of witch the pdf file was created, there are NO line breaks.
But the text extract gets single strings, for each letter one.
Is it possbile to avoid it?

Hope my problem is now comprehensible :)

      was (Author: sandor1990):
    As you can see there are some vertical textboxes in the middle of the page (pdf file).
Referring to the office document out of with the pdf file was created, there are NO line breaks.
But the text extract gets single strings, for each letter one.
Is it possbile to avoid it?

Hope my problem is now comprehensible :)
  
> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.