You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org> on 2011/12/01 12:52:40 UTC

[jira] [Commented] (TIKA-796) Tika breaks words of rotated text in PDF documents

    [ https://issues.apache.org/jira/browse/TIKA-796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160841#comment-13160841 ] 

Michael McCandless commented on TIKA-796:
-----------------------------------------

This looks like a dup of TIKA-723?

Note that with Tika 1.1 (not yet released) you can call PDFParser.setSortByPosition(true) and the rotated text should be extracted correctly (I just confirmed on this PDF).

However, that will also cause eg 2 columns to become "interleaved", usually not what you want if this text is going to be indexed into a search index.

I would love to fix PDFBox somehow to dynamically pick the right setting for the right chunk of text; often the rotated text arrives in the PDF as a single chunk of text and we could in theory extract it correctly even when setSortByPosition is false...
                
> Tika breaks words of rotated text in PDF documents
> --------------------------------------------------
>
>                 Key: TIKA-796
>                 URL: https://issues.apache.org/jira/browse/TIKA-796
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.10, 1.0
>         Environment: Windows 7 Professional x64, Java(TM) SE Runtime Environment (build 1.6.0_25-b06), Java HotSpot(TM) 64-Bit Server VM (build 20.0-b11, mixed mode)
>            Reporter: Franz Canaval
>              Labels: broken, linefeed, pdf, rotated, text, words
>
> When Tika extracts text from a PDF file, *rotated texts are extracted in a way that words are broken.* Apparently the number of lines of a rotated paragraph seems to be the number of characters after which Tika breaks the words apart with a line feed (0x0a) character.
> Steps to reproduce this issue (in this example, on a Windows machine):
> * Download the following pdf file: [http://www.verbraucherzentrale-rlp.de/mediabig/115471A.pdf], e.g. to C:\temp\
> * Open a console window and run tika with: {{java -jar tika-app.jar -t "file:///c:/temp/energieberatung.pdf" > test.txt}}
> * Have a look at the text file, e.g. with a hex editor and note the words broken in 2-character-pieces: {{<char1><char2><LF>}}
> *This problems seems to be introduced with Tika 0.10, it still exists with Tika 1.0.*

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira