You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/09/24 12:42:31 UTC

RE: {EXTERNAL}Problem running OCR

Any thoughts I why I can't get OCR to work on this PDF ?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Peter Kronenberg <pe...@torch.ai>
Sent: Wednesday, September 22, 2021 9:33 PM
To: user@tika.apache.org
Cc: tallison@apache.org
Subject: {EXTERNAL}Problem running OCR

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.


Ok this is one of those situations where I must be doing something stupid, but I can't get Tika to properly process the attached file.  It's an image based PDF.  It's just not getting any text out of it.  Even if I run with OCRStrategy = ONLY_OCR.



It's definitely getting to the call to doOCROnCurrentPage(AUTO)in AbstractPDF2XHTML, so it's not a matter of the character counts preventing the OCR.



Don't think it has anything to do with the fact that it is in German.  Tried setting the language to DEU, but same results

What is going on?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=5a6182eefa654537ab7f264257135b6e>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=5a6182eefa654537ab7f264257135b6e>