You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@tika.apache.org by "marek kapowicki (Jira)" <ji...@apache.org> on 2020/09/23 05:55:00 UTC

[jira] [Closed] (TIKA-3202) Tika duplicates the ocr text

     [ https://issues.apache.org/jira/browse/TIKA-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

marek kapowicki closed TIKA-3202.
---------------------------------
    Resolution: Works for Me

> Tika duplicates the ocr text
> ----------------------------
>
>                 Key: TIKA-3202
>                 URL: https://issues.apache.org/jira/browse/TIKA-3202
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.24.1
>            Reporter: marek kapowicki
>            Priority: Major
>         Attachments: text_and_image.pdf
>
>
> I m using tika 1.24.1 together with tesseract from docker image apache/tika:1.24-full
> The header X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue
> the output from pdf processing is duplicated:
> The output from the attached pdf file is:
> {code:java}
> There is some text 
> [image: image0.jpg]
> There is some textT
> here is an image!!
> {code}
> the curl to reproduce:
> {code:java}
> curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_AND_TEXT" -T text_and_image.pdf  http://localhost:9998/tika
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)