You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "John Mastarone (Commented) (JIRA)" <ji...@apache.org> on 2011/11/25 03:58:40 UTC
[jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs

    [ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156968#comment-13156968 ] 

John Mastarone commented on TIKA-723:
-------------------------------------

With the latest source, I tried adding the line         
"if (parser instanceof org.apache.tika.parser.pdf.PDFParser){ ((org.apache.tika.parser.pdf.PDFParser)parser).setSortByPosition(true);}"
to the CompositeParser class, inside the parse method, right after the line "Parser parser = getParser(metadata);" and also had to add tika-parser as a dependency to the core. Then after building the core jar and tika-app, the text was no longer inappropriately vertical when using the GUI.  It appeared that none of the other PDFs in the test-resources folder were being parsed incorrectly, except for the first one (testAnnotations.pdf) which fails to parse entirely--but it also fails to parse with an unmodified, most-recent version of the Tika GUI, due to the same NPE in both cases.  I don't know if there's a JIRA item for this yet or not. Also, I downloaded the PDFBox application jar and ran ExtractText with the -sort option, and this properly rotated the text in your rotated.pdf file. 

After making the change to CompositeParser that I made, two test cases failed in tika-parsers, lines 147 and 180 of PDFParserTest.java which concern testPDFTwoTextBoxes.pdf and a table in testPDFVarious.pdf.  However, the assertions made in these lines are arguably up for interpretation: should the tika pdf parser really print all of the items in a column before moving onto the next column?  The change I made results in all elements of a given row being printed before moving onto the next row (row major order instead of column major).  This could be fine for the table in testPDFVarious.pdf, but maybe less so for the two text boxes in the other PDF?

So, I'm not experienced with Tika development at all, but perhaps a line (or lines) like the one above should be somewhere in the code--if not in the CompositeParser, then elsewhere, depending on what you and/or others think about the test cases that would fail as a result.  
                
> Rotated text isn't extracted correctly from PDFs
> ------------------------------------------------
>
>                 Key: TIKA-723
>                 URL: https://issues.apache.org/jira/browse/TIKA-723
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: rotated.pdf
>
>
> I have an example PDF with 90 degree rotation; Tika produces the
> characters one line at a time.  Ie, the doc has "Some rotated text,
> here!" but Tika produces this:
> {noformat}
> <body><div class="page"><p>So
> m
> e
>  
> r
> o
> t
> a
> t
> e
> d
>  
> t
> e
> x
> t
> ,
>  
> h
> e
> r
> e
> !</p>
> {noformat}
> I'm able to copy/paste the text out correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira