You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "marek kapowicki (Jira)" <ji...@apache.org> on 2020/09/22 21:28:00 UTC

[jira] [Created] (TIKA-3202) Tika duplicates the ocr text

marek kapowicki created TIKA-3202:
-------------------------------------

             Summary: Tika duplicates the ocr text
                 Key: TIKA-3202
                 URL: https://issues.apache.org/jira/browse/TIKA-3202
             Project: Tika
          Issue Type: Bug
    Affects Versions: 1.24.1
            Reporter: marek kapowicki
         Attachments: text_and_image.pdf

I m using tika 1.24.1 together with tesseract from docker image apache/tika:1.24-full

The header X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue

the output from pdf processing is duplicated:
The output from the attached pdf file is:
{code:java}
There is some text 
[image: image0.jpg]

There is some textT
here is an image!!
{code}

the curl to reproduce:


{code:java}
curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_AND_TEXT" -T text_and_image.pdf  http://localhost:9998/tika
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)