You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pdfbox.apache.org by "Dominik Bauer (JIRA)" <ji...@apache.org> on 2017/02/08 09:33:42 UTC

[jira] [Created] (PDFBOX-3680) Extracted text in wrong order [header, footer, content]

Dominik Bauer created PDFBOX-3680:
-------------------------------------

             Summary: Extracted text in wrong order [header, footer, content]
                 Key: PDFBOX-3680
                 URL: https://issues.apache.org/jira/browse/PDFBOX-3680
             Project: PDFBox
          Issue Type: Bug
          Components: Text extraction
    Affects Versions: 2.0.1
            Reporter: Dominik Bauer
         Attachments: 1_to_3_Text.txt, DSG 2000, Fassung vom 27.01.2017.pdf

When I extract the text from the attached pdf, the text is in the wrong order. 

Every page has a header, which is "Bundesrecht konsolidiert" some content and a footer, which is "www.ris.bka.gv.at Seite x von y". The content of the footer is a URL and the page number in German language.

In my eyes the extracted text should have the same order, as we would look at it. The correct order would be header, content, footer. 
When I open the File in Adobe Reader an copy the text from the page, the text is also in the same order.

The extracted text is:
{quote}
 Bundesrecht konsolidiert 
www.ris.bka.gv.at Seite 1 von 35 
Gesamte Rechtsvorschrift [...] und Rechtsnachfolge
{quote}

When we look at the page; then the extracted text should be:
{quote}
 Bundesrecht konsolidiert 
Gesamte Rechtsvorschrift [...] und Rechtsnachfolge
www.ris.bka.gv.at Seite 1 von 35 
{quote}

The pdf itself and the extracted text of the first three pages is attached to this Ticket.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@pdfbox.apache.org
For additional commands, e-mail: dev-help@pdfbox.apache.org