You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tamara (JIRA)" <ji...@apache.org> on 2015/01/28 09:11:34 UTC
[jira] [Created] (TIKA-1533) PDF parse failing to capture right
order of text (2 columns)
Tamara created TIKA-1533:
----------------------------
Summary: PDF parse failing to capture right order of text (2 columns)
Key: TIKA-1533
URL: https://issues.apache.org/jira/browse/TIKA-1533
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.7, 1.6
Environment: Java 8, Mac OS X
Reporter: Tamara
When I am converting a document with two columns the order of the columns are inverted in the text file. I only could notice because it is an index list. The page I start to see the problem is the page 303, to look in the converted text look for 362. In the second file I have the same problem the page is 341.
I have tried: setSortByPosition(true) and the columns got scrambled.
I have tried to copy and paste from the pdf preview and the copy is as it should.
And I have tried to use PDFXStream and it parses in the right way.
Here are the files I have seen the issue:
http://www.sbu.se/upload/Publikationer/Content0/1/Autismspektrumtillst%C3%A5nd_fulltext.pdf
http://www.sbu.se/upload/publikationer/content0/1/forstamningssyndrom_fulltext.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)