You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "marek kapowicki (Jira)" <ji...@apache.org> on 2020/09/23 05:55:00 UTC
[jira] [Closed] (TIKA-3202) Tika duplicates the ocr text
[ https://issues.apache.org/jira/browse/TIKA-3202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
marek kapowicki closed TIKA-3202.
---------------------------------
Resolution: Works for Me
> Tika duplicates the ocr text
> ----------------------------
>
> Key: TIKA-3202
> URL: https://issues.apache.org/jira/browse/TIKA-3202
> Project: Tika
> Issue Type: Bug
> Affects Versions: 1.24.1
> Reporter: marek kapowicki
> Priority: Major
> Attachments: text_and_image.pdf
>
>
> I m using tika 1.24.1 together with tesseract from docker image apache/tika:1.24-full
> The headerĀ X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue
> the output from pdf processing is duplicated:
> The output from the attached pdf file is:
> {code:java}
> There is some text
> [image: image0.jpg]
> There is some textT
> here is an image!!
> {code}
> the curl to reproduce:
> {code:java}
> curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_AND_TEXT" -T text_and_image.pdf http://localhost:9998/tika
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)