You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Michael McCandless (JIRA)" <ji...@apache.org> on 2011/09/19 19:45:09 UTC

[jira] [Created] (TIKA-723) Rotated text isn't extracted correctly from PDFs

Rotated text isn't extracted correctly from PDFs
------------------------------------------------

                 Key: TIKA-723
                 URL: https://issues.apache.org/jira/browse/TIKA-723
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
            Priority: Minor
         Attachments: rotated.pdf

I have an example PDF with 90 degree rotation; Tika produces the
characters one line at a time.  Ie, the doc has "Some rotated text,
here!" but Tika produces this:

{noformat}
<body><div class="page"><p>So
m
e
 
r
o
t
a
t
e
d
 
t
e
x
t
,
 
h
e
r
e
!</p>
{noformat}

I'm able to copy/paste the text out correctly.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (TIKA-723) Rotated text isn't extracted correctly from PDFs

Posted by "Michael McCandless (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Michael McCandless updated TIKA-723:
------------------------------------

    Attachment: rotated.pdf

> Rotated text isn't extracted correctly from PDFs
> ------------------------------------------------
>
>                 Key: TIKA-723
>                 URL: https://issues.apache.org/jira/browse/TIKA-723
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: rotated.pdf
>
>
> I have an example PDF with 90 degree rotation; Tika produces the
> characters one line at a time.  Ie, the doc has "Some rotated text,
> here!" but Tika produces this:
> {noformat}
> <body><div class="page"><p>So
> m
> e
>  
> r
> o
> t
> a
> t
> e
> d
>  
> t
> e
> x
> t
> ,
>  
> h
> e
> r
> e
> !</p>
> {noformat}
> I'm able to copy/paste the text out correctly.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs

Posted by "Nick Burch (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157581#comment-13157581 ] 

Nick Burch commented on TIKA-723:
---------------------------------

I think the idea is to offer these sort of PDFBox options via the parser context, so you can toggle them on or off as you desire. See TIKA-612 for details on the progress with that
                
> Rotated text isn't extracted correctly from PDFs
> ------------------------------------------------
>
>                 Key: TIKA-723
>                 URL: https://issues.apache.org/jira/browse/TIKA-723
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: rotated.pdf
>
>
> I have an example PDF with 90 degree rotation; Tika produces the
> characters one line at a time.  Ie, the doc has "Some rotated text,
> here!" but Tika produces this:
> {noformat}
> <body><div class="page"><p>So
> m
> e
>  
> r
> o
> t
> a
> t
> e
> d
>  
> t
> e
> x
> t
> ,
>  
> h
> e
> r
> e
> !</p>
> {noformat}
> I'm able to copy/paste the text out correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs

Posted by "John Mastarone (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156968#comment-13156968 ] 

John Mastarone commented on TIKA-723:
-------------------------------------

With the latest source, I tried adding the line         
"if (parser instanceof org.apache.tika.parser.pdf.PDFParser){ ((org.apache.tika.parser.pdf.PDFParser)parser).setSortByPosition(true);}"
to the CompositeParser class, inside the parse method, right after the line "Parser parser = getParser(metadata);" and also had to add tika-parser as a dependency to the core. Then after building the core jar and tika-app, the text was no longer inappropriately vertical when using the GUI.  It appeared that none of the other PDFs in the test-resources folder were being parsed incorrectly, except for the first one (testAnnotations.pdf) which fails to parse entirely--but it also fails to parse with an unmodified, most-recent version of the Tika GUI, due to the same NPE in both cases.  I don't know if there's a JIRA item for this yet or not. Also, I downloaded the PDFBox application jar and ran ExtractText with the -sort option, and this properly rotated the text in your rotated.pdf file. 

After making the change to CompositeParser that I made, two test cases failed in tika-parsers, lines 147 and 180 of PDFParserTest.java which concern testPDFTwoTextBoxes.pdf and a table in testPDFVarious.pdf.  However, the assertions made in these lines are arguably up for interpretation: should the tika pdf parser really print all of the items in a column before moving onto the next column?  The change I made results in all elements of a given row being printed before moving onto the next row (row major order instead of column major).  This could be fine for the table in testPDFVarious.pdf, but maybe less so for the two text boxes in the other PDF?

So, I'm not experienced with Tika development at all, but perhaps a line (or lines) like the one above should be somewhere in the code--if not in the CompositeParser, then elsewhere, depending on what you and/or others think about the test cases that would fail as a result.  
                
> Rotated text isn't extracted correctly from PDFs
> ------------------------------------------------
>
>                 Key: TIKA-723
>                 URL: https://issues.apache.org/jira/browse/TIKA-723
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: rotated.pdf
>
>
> I have an example PDF with 90 degree rotation; Tika produces the
> characters one line at a time.  Ie, the doc has "Some rotated text,
> here!" but Tika produces this:
> {noformat}
> <body><div class="page"><p>So
> m
> e
>  
> r
> o
> t
> a
> t
> e
> d
>  
> t
> e
> x
> t
> ,
>  
> h
> e
> r
> e
> !</p>
> {noformat}
> I'm able to copy/paste the text out correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-723) Rotated text isn't extracted correctly from PDFs

Posted by "Michael McCandless (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/TIKA-723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157602#comment-13157602 ] 

Michael McCandless commented on TIKA-723:
-----------------------------------------

The sortByPosition option is tricky to default "properly" since it's very much dependent on whether you are using the resulting text/xhtml to 1) feed into a search engine (in which case, at least for the 2-column type of PDFs, you don't want to sort by position), or 2) rendering to something a user will directly look at (in which case I think you do want to sort by position, to have better "fidelity" with what the document looks like when viewed in a PDF viewer).

The default has flipped back and forth recently... and is currently off, but with TIKA-612 you can now set it directly on your PDFParser instance.
                
> Rotated text isn't extracted correctly from PDFs
> ------------------------------------------------
>
>                 Key: TIKA-723
>                 URL: https://issues.apache.org/jira/browse/TIKA-723
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: rotated.pdf
>
>
> I have an example PDF with 90 degree rotation; Tika produces the
> characters one line at a time.  Ie, the doc has "Some rotated text,
> here!" but Tika produces this:
> {noformat}
> <body><div class="page"><p>So
> m
> e
>  
> r
> o
> t
> a
> t
> e
> d
>  
> t
> e
> x
> t
> ,
>  
> h
> e
> r
> e
> !</p>
> {noformat}
> I'm able to copy/paste the text out correctly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira