You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Sandor Dj (JIRA)" <ji...@apache.org> on 2010/08/24 08:40:18 UTC

[jira] Created: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Wrong text extract from vertical textboxes in pdf files
-------------------------------------------------------

                 Key: PDFBOX-800
                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
             Project: PDFBox
          Issue Type: Bug
         Environment: Win 7, VS 2010 C#
            Reporter: Sandor Dj
            Priority: Critical


I was told to move this issue to the pdfbox parser, so I hope this is the right section.
Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
H
E
L
L
O
the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901763#action_12901763 ] 

Andreas Lehmkühler commented on PDFBOX-800:
-------------------------------------------

Please attach a sample document to this issue if possible.

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Sandor Dj (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905430#action_12905430 ] 

Sandor Dj commented on PDFBOX-800:
----------------------------------

I'm using TIKA and the the AutodetectParser to extract Text, so PDFTextStripper is not in use. like shown in the following example: http://blogs.dovetailsoftware.com/blogs/kmiller/archive/2010/07/02/using-the-tika-java-library-in-your-net-application-with-ikvm.aspx

any other suggestions :\ ?

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905080#action_12905080 ] 

Jukka Zitting commented on PDFBOX-800:
--------------------------------------

Setting the sortByPosition option on the PDFTextStripper should make the "Hallo das ist ein vertikales TEXTFELD" box get correctly extracted.

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905874#action_12905874 ] 

Jukka Zitting commented on PDFBOX-800:
--------------------------------------

We might consider enabling the sortByPosition option by default in PDFBox, as the performance impact isn't too bad (around 5% on some documents I tested with). Alternatively you can file an improvement request for Tika to explicitly set the option in its PDFParser class.

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows 7, VS 2010 C#, Tika Library
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Sandor Dj (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901768#action_12901768 ] 

Sandor Dj edited comment on PDFBOX-800 at 8/24/10 4:22 AM:
-----------------------------------------------------------

As you can see there are some vertical textboxes in the middle of the page (pdf file).
Referring to the office document out of witch the pdf file was created, there are NO line breaks.
But the text extract gets single strings, for each letter one.
Is it possbile to avoid it?

Hope my problem is now comprehensible :)

      was (Author: sandor1990):
    As you can see there are some vertical textboxes in the middle of the page (pdf file).
Referring to the office document out of with the pdf file was created, there are NO line breaks.
But the text extract gets single strings, for each letter one.
Is it possbile to avoid it?

Hope my problem is now comprehensible :)
  
> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Sandor Dj (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904550#action_12904550 ] 

Sandor Dj commented on PDFBOX-800:
----------------------------------

Okay, i see the problem.
But what is about the textbox "Hallo das ist ein vertikales TEXTFELD" (the first vertical one on the left side)? Why is this one not extracted correctly? The font is turned 90° around... 
We have some other PDF files with similar textboxes and they are also extracted in a wrong way.

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903435#action_12903435 ] 

Jukka Zitting commented on PDFBOX-800:
--------------------------------------

One possible approach would be to divide the characters on a page to different "layers" depending on the orientation in which they are drawn. One layer would contain only horizontal characters, while others would contain vertical and diagonal ones. With appropriate rotation we could then apply the normal horizontal text extraction algorithm also for the vertically and diagonally drawn characters.

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Sandor Dj (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sandor Dj updated PDFBOX-800:
-----------------------------

    Environment: Windows 7, VS 2010 C#  (was: Win 7, VS 2010 C#)

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Sandor Dj (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sandor Dj updated PDFBOX-800:
-----------------------------

    Environment: Windows 7, VS 2010 C#, Tika Library  (was: Windows 7, VS 2010 C#)

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Windows 7, VS 2010 C#, Tika Library
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903439#action_12903439 ] 

Jukka Zitting commented on PDFBOX-800:
--------------------------------------

Hmm, actually PDFBox already does properly extract the "Hallo das ist ein anderes vertikales TEXTFELD" and "Hallo das ist ein horizontales TEXTFELD" sentences from the example document.

Handling the vertical "Hallo" text boxes where the characters are horizontally oriented is probably impossible unless there's some external hint that the text should be treated like vertical writing in Chinese or Japanese.

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Andreas Lehmkühler (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andreas Lehmkühler updated PDFBOX-800:
--------------------------------------

    Priority: Major  (was: Critical)

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903440#action_12903440 ] 

Mel Martinez commented on PDFBOX-800:
-------------------------------------

I guess you could do it if you used a vertical text field but with a font that rendered each character lying on its side ... !

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903440#action_12903440 ] 

Mel Martinez edited comment on PDFBOX-800 at 8/27/10 11:00 AM:
---------------------------------------------------------------

I guess you could do it if you used a vertical text field but with a font that rendered each character lying on its side ... !

I.E. a font that rendered a capital 'E' so that it looked like a 'W'.

      was (Author: m.martinez):
    I guess you could do it if you used a vertical text field but with a font that rendered each character lying on its side ... !
  
> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Sandor Dj (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sandor Dj updated PDFBOX-800:
-----------------------------

    Attachment: problemdoc.pdf
                problemdoc.doc

As you can see there are some vertical textboxes in the middle of the page (pdf file).
Referring to the office document out of with the pdf file was created, there are NO line breaks.
But the text extract gets single strings, for each letter one.
Is it possbile to avoid it?

Hope my problem is now comprehensible :)

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Sandor Dj (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sandor Dj updated PDFBOX-800:
-----------------------------

    Component/s: Parsing

> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PDFBOX-800) Wrong text extract from vertical textboxes in pdf files

Posted by "Mel Martinez (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/PDFBOX-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903430#action_12903430 ] 

Mel Martinez commented on PDFBOX-800:
-------------------------------------

This is a tricky problem to try to resolve.

The reason this occurs is because fundamentally PDF is not a structured data format.  It is a page rendering format.   That means that the letters:

H
E
L
L
O

May not be stored within the PDF as an integral character sequence (and probably are not).  Instead they exist as commands to 'render' each character on the page in the desired location and with the specified attributes (size, color, font, etc.).

The fact that they are not separated by a carriage return when entered into the creation of the document doesn't really mean anything as PDF doesn't really have the concept of carriage returns.   In the PDF, text starts on the next line down by the fact that the next text object is to be rendered at the coordinates that _look_ like a carriage return is there.

When PDFBox 'extracts' text, all it is really doing is _rendering_ the PDF to a text file.  So it tries to guess based on the character coordinates (and their proximity to other characters being rendered) on when to insert white space control characters such as spaces and carriage returns.

Basically, it is 'drawing' each page using the limitation that the only drawing tool is plain characters!

So, a piece of vertical text like you have here is tricky because there is no inherent way for PDF Box to know for certain that the characters are meant to be contiguous within a single word.   I.E. one could also have a page with the characters:

1
2
3
4
...

where that is meant to be a template for a list - PDFBox can't really know the difference.  You wouldn't want that text to be extracted as "1234..."

Someone else might have an idea for a solution here, but I don't see an obvious one.


> Wrong text extract from vertical textboxes in pdf files
> -------------------------------------------------------
>
>                 Key: PDFBOX-800
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-800
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>         Environment: Win 7, VS 2010 C#
>            Reporter: Sandor Dj
>         Attachments: problemdoc.doc, problemdoc.pdf
>
>
> I was told to move this issue to the pdfbox parser, so I hope this is the right section.
> Vertical textboxes in pdf files are not extracted correctly (using the tika library in C#).
> For example if there is a vertical textbox "hello" in a pdf file (!WITHOUT! line breaks):
> H
> E
> L
> L
> O
> the parser returns 5 strings, each with a single letter, even there is NO line break after every letter.
> Is there a option to avoid this problem?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.