You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "Tamara (JIRA)" <ji...@apache.org> on 2015/01/28 09:11:34 UTC

[jira] [Created] (TIKA-1533) PDF parse failing to capture right order of text (2 columns)

Tamara created TIKA-1533:
----------------------------

             Summary: PDF parse failing to capture right order of text (2 columns)
                 Key: TIKA-1533
                 URL: https://issues.apache.org/jira/browse/TIKA-1533
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.7, 1.6
         Environment: Java 8, Mac OS X
            Reporter: Tamara


When I am converting a document with two columns the order of the columns are inverted in the text file. I only could notice because it is an index list. The page I start to see the problem is the page 303, to look in the converted text look for 362. In the second file I have the same problem the page is 341.

I have tried: setSortByPosition(true) and the columns got scrambled.

I have tried to copy and paste from the pdf preview and the copy is as it should.

And I have tried to use PDFXStream and it parses in the right way.

Here are the files I have seen the issue:
http://www.sbu.se/upload/Publikationer/Content0/1/Autismspektrumtillst%C3%A5nd_fulltext.pdf

http://www.sbu.se/upload/publikationer/content0/1/forstamningssyndrom_fulltext.pdf





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)