You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "marek kapowicki (Jira)" <ji...@apache.org> on 2020/09/22 21:28:00 UTC
[jira] [Created] (TIKA-3202) Tika duplicates the ocr text
marek kapowicki created TIKA-3202:
-------------------------------------
Summary: Tika duplicates the ocr text
Key: TIKA-3202
URL: https://issues.apache.org/jira/browse/TIKA-3202
Project: Tika
Issue Type: Bug
Affects Versions: 1.24.1
Reporter: marek kapowicki
Attachments: text_and_image.pdf
I m using tika 1.24.1 together with tesseract from docker image apache/tika:1.24-full
The headerĀ X-Tika-PDFocrStrategy: OCR_AND_TEXT occurs the issue
the output from pdf processing is duplicated:
The output from the attached pdf file is:
{code:java}
There is some text
[image: image0.jpg]
There is some textT
here is an image!!
{code}
the curl to reproduce:
{code:java}
curl -H "X-Tika-PDFextractInlineImages: true" -H "X-Tika-PDFocrStrategy: OCR_AND_TEXT" -T text_and_image.pdf http://localhost:9998/tika
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)