You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/09/23 01:32:55 UTC

Problem running OCR

Ok this is one of those situations where I must be doing something stupid, but I can't get Tika to properly process the attached file.  It's an image based PDF.  It's just not getting any text out of it.  Even if I run with OCRStrategy = ONLY_OCR.

It's definitely getting to the call to doOCROnCurrentPage(AUTO)in AbstractPDF2XHTML, so it's not a matter of the character counts preventing the OCR.

Don't think it has anything to do with the fact that it is in German.  Tried setting the language to DEU, but same results

What is going on?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>

RE: {EXTERNAL}Problem running OCR

Posted by Peter Kronenberg <pe...@torch.ai>.

Any thoughts I why I can't get OCR to work on this PDF ?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Peter Kronenberg <pe...@torch.ai>
Sent: Wednesday, September 22, 2021 9:33 PM
To: user@tika.apache.org
Cc: tallison@apache.org
Subject: {EXTERNAL}Problem running OCR

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.


Ok this is one of those situations where I must be doing something stupid, but I can't get Tika to properly process the attached file.  It's an image based PDF.  It's just not getting any text out of it.  Even if I run with OCRStrategy = ONLY_OCR.



It's definitely getting to the call to doOCROnCurrentPage(AUTO)in AbstractPDF2XHTML, so it's not a matter of the character counts preventing the OCR.



Don't think it has anything to do with the fact that it is in German.  Tried setting the language to DEU, but same results

What is going on?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=5a6182eefa654537ab7f264257135b6e>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=5a6182eefa654537ab7f264257135b6e>

Re: Problem running OCR

Posted by Tim Allison <ta...@apache.org>.

I even broke out my Windows laptop, and the basic commandline w tika-app
works there, too...in 2.0.0 and 2.1.0.

On Fri, Sep 24, 2021 at 10:31 AM Tim Allison <ta...@apache.org> wrote:

> If you turn off all the configurations, does it work for you?
>
> On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
>> I was afraid it would work for you 😊
>>
>>
>>
>> *From:* Tim Allison <ta...@apache.org>
>> *Sent:* Friday, September 24, 2021 10:09 AM
>> *To:* Peter Kronenberg <pe...@torch.ai>
>> *Cc:* user@tika.apache.org
>> *Subject:* Re: Problem running OCR
>>
>>
>>
>> I'm having luck with 2.1.0's app.  How are you calling Tika?  What
>> configurations do you have?  Is tesseract on your command line, etc?
>>
>>
>>
>> java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf
>>
>> INFO  [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser
>> Tesseract is installed and is being invoked. This can add greatly to
>> processing time.  If you do not want tesseract to be applied to your
>> files see:
>> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
>> <https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069>
>>
>> <?xml version="1.0" encoding="UTF-8"?><html xmlns="
>> http://www.w3.org/1999/xhtml
>> <https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069>
>> ">
>>
>> <head>
>>
>> <meta name="pdf:PDFVersion" content="1.7"/>
>>
>> <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/>
>>
>> <meta name="pdf:hasXFA" content="false"/>
>>
>> <meta name="access_permission:modify_annotations" content="true"/>
>>
>> <meta name="access_permission:can_print_degraded" content="true"/>
>>
>> <meta name="dc:creator" content="Michele Stutz"/>
>>
>> <meta name="dcterms:created" content="2021-09-22T20:14:08Z"/>
>>
>> <meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/>
>>
>> <meta name="dc:format" content="application/pdf; version=1.7"/>
>>
>> <meta name="xmpMM:DocumentID"
>> content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/>
>>
>> <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for
>> Microsoft 365"/>
>>
>> <meta name="access_permission:fill_in_form" content="true"/>
>>
>> <meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/>
>>
>> <meta name="pdf:encrypted" content="false"/>
>>
>> <meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/>
>>
>> <meta name="Content-Length" content="38927"/>
>>
>> <meta name="pdf:hasMarkedContent" content="true"/>
>>
>> <meta name="Content-Type" content="application/pdf"/>
>>
>> <meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/>
>>
>> <meta name="pdf:docinfo:creator" content="Michele Stutz"/>
>>
>> <meta name="dc:language" content="en-US"/>
>>
>> <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/>
>>
>> <meta name="access_permission:extract_for_accessibility" content="true"/>
>>
>> <meta name="access_permission:assemble_document" content="true"/>
>>
>> <meta name="xmpTPg:NPages" content="1"/>
>>
>> <meta name="resourceName" content="sample german image.pdf"/>
>>
>> <meta name="pdf:hasXMP" content="true"/>
>>
>> <meta name="access_permission:extract_content" content="true"/>
>>
>> <meta name="access_permission:can_print" content="true"/>
>>
>> <meta name="X-TIKA:Parsed-By"
>> content="org.apache.tika.parser.DefaultParser"/>
>>
>> <meta name="X-TIKA:Parsed-By"
>> content="org.apache.tika.parser.pdf.PDFParser"/>
>>
>> <meta name="access_permission:can_modify" content="true"/>
>>
>> <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft
>> 365"/>
>>
>> <meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/>
>>
>> <title/>
>>
>> </head>
>>
>> <body><div class="page"><p/>
>>
>> <p> </p>
>>
>> <p/>
>>
>> <div class="ocr">Armin Laschet will an die Spitze und kampft
>>
>>
>>
>> Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht
>> unter Druck.
>>
>> Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet
>> nun
>>
>> kampferisch und warnt vor einem Linksruck.
>>
>> </div>
>>
>>
>>
>> </div>
>>
>>
>>
>> On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg <
>> peter.kronenberg@torch.ai> wrote:
>>
>> Ok this is one of those situations where I must be doing something stupid, but I can’t get Tika to properly process the attached file.  It’s an image based PDF.  It’s just not getting any text out of it.  Even if I run with OCRStrategy = ONLY_OCR.
>>
>> It’s definitely getting to the call to doOCROnCurrentPage(*AUTO*)in AbstractPDF2XHTML, so it’s not a matter of the character counts preventing the OCR.
>>
>>
>>
>> Don’t think it has anything to do with the fact that it is in German.
>> Tried setting the language to DEU, but same results
>>
>>
>>
>> What is going on?
>>
>>
>>
>> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>>
>> *C: 703.887.5623*
>>
>> [image: Torch AI]
>> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
>>
>> 4303 W. 119th St., Leawood, KS 66209
>> WWW.TORCH.AI
>> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
>>
>>
>>
>>
>>
>>

RE: Problem running OCR

Posted by Peter Kronenberg <pe...@torch.ai>.

So the option that was throwing me off was extractInlineImages.  Has something changes recently?  My code hasn’t changed and I’m not sure how I wouldn’t have noticed this before.
If extractInlineImages=false, does that mean that OCR won’t work at all for the PDF?  Even if it is a non-searching PDF where each page is a scanned image?

And I can’t figure out why it’s set to FALSE. In tika-config.xml, I have
<parser class="org.apache.tika.parser.pdf.PDFParser">
    <params>
        <param name="extractInlineImages" type="bool">true</param>
    </params>
</parser>


In my code, I have

TikaConfig tikaConfig;
try (InputStream is = TikaOCRParser.class.getClassLoader().getResourceAsStream("tika-config.xml")) {
    tikaConfig = new TikaConfig(is);
}

final PDFParserConfig pdfConfig = new PDFParserConfig();
final TesseractOCRConfig tessConfig = new TesseractOCRConfig();
final AutoDetectParser parser = new AutoDetectParser(tikaConfig);
final ParseContext parseContext = new ParseContext();

parseContext.set(AutoDetectParser.class, parser);
parseContext.set(PDFParserConfig.class, pdfConfig);
parseContext.set(TesseractOCRConfig.class, tessConfig);


I know I probably talked to you about this at the time, and thought I had it right.  Is this correct that I’m passing the tikaConfig to the AutoDetectParser()?
When I print the value of isExtractInlineImages right after instnatiaton PDFPaserConfig, it comes up as FALSE.  What is the Tika default for this


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Peter Kronenberg
Sent: Friday, September 24, 2021 10:36 AM
To: tallison@apache.org; user@tika.apache.org
Subject: RE: Problem running OCR

Duh, thanks!  It worked.  Now I have to figure out which config option was messing it up.

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <ta...@apache.org>>
Sent: Friday, September 24, 2021 10:32 AM
To: Peter Kronenberg <pe...@torch.ai>>; user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Problem run
If you turn off all the configurations, does it work for you?

On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg <pe...@torch.ai>> wrote:
I was afraid it would work for you 😊

From: Tim Allison <ta...@apache.org>>
Sent: Friday, September 24, 2021 10:09 AM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Problem running OCR

I'm having luck with 2.1.0's app.  How are you calling Tika?  What configurations do you have?  Is tesseract on your command line, etc?


java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf

INFO  [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract is installed and is being invoked. This can add greatly to processing time.  If you do not want tesseract to be applied to your files see: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr<https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069>

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml<https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069>">

<head>

<meta name="pdf:PDFVersion" content="1.7"/>

<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:hasXFA" content="false"/>

<meta name="access_permission:modify_annotations" content="true"/>

<meta name="access_permission:can_print_degraded" content="true"/>

<meta name="dc:creator" content="Michele Stutz"/>

<meta name="dcterms:created" content="2021-09-22T20:14:08Z"/>

<meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/>

<meta name="dc:format" content="application/pdf; version=1.7"/>

<meta name="xmpMM:DocumentID" content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/>

<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 365"/>

<meta name="access_permission:fill_in_form" content="true"/>

<meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/>

<meta name="pdf:encrypted" content="false"/>

<meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/>

<meta name="Content-Length" content="38927"/>

<meta name="pdf:hasMarkedContent" content="true"/>

<meta name="Content-Type" content="application/pdf"/>

<meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/>

<meta name="pdf:docinfo:creator" content="Michele Stutz"/>

<meta name="dc:language" content="en-US"/>

<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="access_permission:extract_for_accessibility" content="true"/>

<meta name="access_permission:assemble_document" content="true"/>

<meta name="xmpTPg:NPages" content="1"/>

<meta name="resourceName" content="sample german image.pdf"/>

<meta name="pdf:hasXMP" content="true"/>

<meta name="access_permission:extract_content" content="true"/>

<meta name="access_permission:can_print" content="true"/>

<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>

<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>

<meta name="access_permission:can_modify" content="true"/>

<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/>

<title/>

</head>

<body><div class="page"><p/>

<p> </p>

<p/>

<div class="ocr">Armin Laschet will an die Spitze und kampft



Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht unter Druck.

Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet nun

kampferisch und warnt vor einem Linksruck.

</div>



</div>

On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg <pe...@torch.ai>> wrote:

Ok this is one of those situations where I must be doing something stupid, but I can’t get Tika to properly process the attached file.  It’s an image based PDF.  It’s just not getting any text out of it.  Even if I run with OCRStrategy = ONLY_OCR.



It’s definitely getting to the call to doOCROnCurrentPage(AUTO)in AbstractPDF2XHTML, so it’s not a matter of the character counts preventing the OCR.



Don’t think it has anything to do with the fact that it is in German.  Tried setting the language to DEU, but same results

What is going on?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>

RE: Problem running OCR

Posted by Peter Kronenberg <pe...@torch.ai>.

Duh, thanks!  It worked.  Now I have to figure out which config option was messing it up.

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <ta...@apache.org>
Sent: Friday, September 24, 2021 10:32 AM
To: Peter Kronenberg <pe...@torch.ai>; user@tika.apache.org
Subject: Re: Problem running OCR


If you turn off all the configurations, does it work for you?

On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg <pe...@torch.ai>> wrote:
I was afraid it would work for you 😊

From: Tim Allison <ta...@apache.org>>
Sent: Friday, September 24, 2021 10:09 AM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Problem running OCR

I'm having luck with 2.1.0's app.  How are you calling Tika?  What configurations do you have?  Is tesseract on your command line, etc?


java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf

INFO  [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract is installed and is being invoked. This can add greatly to processing time.  If you do not want tesseract to be applied to your files see: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr<https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069>

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml<https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069>">

<head>

<meta name="pdf:PDFVersion" content="1.7"/>

<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:hasXFA" content="false"/>

<meta name="access_permission:modify_annotations" content="true"/>

<meta name="access_permission:can_print_degraded" content="true"/>

<meta name="dc:creator" content="Michele Stutz"/>

<meta name="dcterms:created" content="2021-09-22T20:14:08Z"/>

<meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/>

<meta name="dc:format" content="application/pdf; version=1.7"/>

<meta name="xmpMM:DocumentID" content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/>

<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 365"/>

<meta name="access_permission:fill_in_form" content="true"/>

<meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/>

<meta name="pdf:encrypted" content="false"/>

<meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/>

<meta name="Content-Length" content="38927"/>

<meta name="pdf:hasMarkedContent" content="true"/>

<meta name="Content-Type" content="application/pdf"/>

<meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/>

<meta name="pdf:docinfo:creator" content="Michele Stutz"/>

<meta name="dc:language" content="en-US"/>

<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="access_permission:extract_for_accessibility" content="true"/>

<meta name="access_permission:assemble_document" content="true"/>

<meta name="xmpTPg:NPages" content="1"/>

<meta name="resourceName" content="sample german image.pdf"/>

<meta name="pdf:hasXMP" content="true"/>

<meta name="access_permission:extract_content" content="true"/>

<meta name="access_permission:can_print" content="true"/>

<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>

<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>

<meta name="access_permission:can_modify" content="true"/>

<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/>

<title/>

</head>

<body><div class="page"><p/>

<p> </p>

<p/>

<div class="ocr">Armin Laschet will an die Spitze und kampft



Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht unter Druck.

Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet nun

kampferisch und warnt vor einem Linksruck.

</div>



</div>

On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg <pe...@torch.ai>> wrote:

Ok this is one of those situations where I must be doing something stupid, but I can’t get Tika to properly process the attached file.  It’s an image based PDF.  It’s just not getting any text out of it.  Even if I run with OCRStrategy = ONLY_OCR.



It’s definitely getting to the call to doOCROnCurrentPage(AUTO)in AbstractPDF2XHTML, so it’s not a matter of the character counts preventing the OCR.



Don’t think it has anything to do with the fact that it is in German.  Tried setting the language to DEU, but same results

What is going on?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>

Re: Problem running OCR

Posted by Tim Allison <ta...@apache.org>.

If you turn off all the configurations, does it work for you?

On Fri, Sep 24, 2021 at 10:21 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> I was afraid it would work for you 😊
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Friday, September 24, 2021 10:09 AM
> *To:* Peter Kronenberg <pe...@torch.ai>
> *Cc:* user@tika.apache.org
> *Subject:* Re: Problem running OCR
>
>
>
> I'm having luck with 2.1.0's app.  How are you calling Tika?  What
> configurations do you have?  Is tesseract on your command line, etc?
>
>
>
> java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf
>
> INFO  [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser
> Tesseract is installed and is being invoked. This can add greatly to
> processing time.  If you do not want tesseract to be applied to your
> files see:
> https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr
> <https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069>
>
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="
> http://www.w3.org/1999/xhtml
> <https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069>
> ">
>
> <head>
>
> <meta name="pdf:PDFVersion" content="1.7"/>
>
> <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/>
>
> <meta name="pdf:hasXFA" content="false"/>
>
> <meta name="access_permission:modify_annotations" content="true"/>
>
> <meta name="access_permission:can_print_degraded" content="true"/>
>
> <meta name="dc:creator" content="Michele Stutz"/>
>
> <meta name="dcterms:created" content="2021-09-22T20:14:08Z"/>
>
> <meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/>
>
> <meta name="dc:format" content="application/pdf; version=1.7"/>
>
> <meta name="xmpMM:DocumentID"
> content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/>
>
> <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for
> Microsoft 365"/>
>
> <meta name="access_permission:fill_in_form" content="true"/>
>
> <meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/>
>
> <meta name="pdf:encrypted" content="false"/>
>
> <meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/>
>
> <meta name="Content-Length" content="38927"/>
>
> <meta name="pdf:hasMarkedContent" content="true"/>
>
> <meta name="Content-Type" content="application/pdf"/>
>
> <meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/>
>
> <meta name="pdf:docinfo:creator" content="Michele Stutz"/>
>
> <meta name="dc:language" content="en-US"/>
>
> <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/>
>
> <meta name="access_permission:extract_for_accessibility" content="true"/>
>
> <meta name="access_permission:assemble_document" content="true"/>
>
> <meta name="xmpTPg:NPages" content="1"/>
>
> <meta name="resourceName" content="sample german image.pdf"/>
>
> <meta name="pdf:hasXMP" content="true"/>
>
> <meta name="access_permission:extract_content" content="true"/>
>
> <meta name="access_permission:can_print" content="true"/>
>
> <meta name="X-TIKA:Parsed-By"
> content="org.apache.tika.parser.DefaultParser"/>
>
> <meta name="X-TIKA:Parsed-By"
> content="org.apache.tika.parser.pdf.PDFParser"/>
>
> <meta name="access_permission:can_modify" content="true"/>
>
> <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft
> 365"/>
>
> <meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/>
>
> <title/>
>
> </head>
>
> <body><div class="page"><p/>
>
> <p> </p>
>
> <p/>
>
> <div class="ocr">Armin Laschet will an die Spitze und kampft
>
>
>
> Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht
> unter Druck.
>
> Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet
> nun
>
> kampferisch und warnt vor einem Linksruck.
>
> </div>
>
>
>
> </div>
>
>
>
> On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> Ok this is one of those situations where I must be doing something stupid, but I can’t get Tika to properly process the attached file.  It’s an image based PDF.  It’s just not getting any text out of it.  Even if I run with OCRStrategy = ONLY_OCR.
>
> It’s definitely getting to the call to doOCROnCurrentPage(*AUTO*)in AbstractPDF2XHTML, so it’s not a matter of the character counts preventing the OCR.
>
>
>
> Don’t think it has anything to do with the fact that it is in German.
> Tried setting the language to DEU, but same results
>
>
>
> What is going on?
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI]
> <https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI
> <https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
>
>
>
>
>
>

RE: Problem running OCR

Posted by Peter Kronenberg <pe...@torch.ai>.

I was afraid it would work for you 😊

Here’s the code I’m using, followed by my output:

[package org.torchai;      import lombok.extern.slf4j.Slf4j;  import org.apache.commons.lang3.time.StopWatch;  import org.apache.tika.config.TikaConfig;  import org.apache.tika.exception.TikaException;  import org.apache.tika.io.TikaInputStream;  import org.apache.tika.langdetect.optimaize.OptimaizeLangDetector;  import org.apache.tika.language.detect.LanguageDetector;  import org.apache.tika.language.detect.LanguageHandler;  import org.apache.tika.metadata.Metadata;  import org.apache.tika.parser.AutoDetectParser;  import org.apache.tika.parser.ParseContext;  import org.apache.tika.parser.ocr.TesseractOCRConfig;  import org.apache.tika.parser.pdf.PDFParserConfig;  import org.apache.tika.sax.BodyContentHandler;  import org.apache.tika.sax.TeeContentHandler;  import org.apache.tika.sax.ToXMLContentHandler;  import org.xml.sax.ContentHandler;  import org.xml.sax.SAXException;    import java.io.IOException;  import java.io.InputStream;  import java.nio.file.Paths;  import java.util.concurrent.TimeUnit;    @Slf4j  public class TikaOCRParser {        public static String parse(String file) throws TikaException, SAXException, IOException {          TikaConfig tikaConfig;          try (InputStream is = TikaOCRParser.class.getClassLoader().getResourceAsStream("tika-config.xml")) {              tikaConfig = new TikaConfig(is);          }              final PDFParserConfig pdfConfig = new PDFParserConfig();          pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.AUTO);            final TesseractOCRConfig tessConfig = new TesseractOCRConfig();            final AutoDetectParser parser = new AutoDetectParser(tikaConfig);            final ParseContext parseContext = new ParseContext();            parseContext.set(AutoDetectParser.class, parser);          parseContext.set(PDFParserConfig.class, pdfConfig);          parseContext.set(TesseractOCRConfig.class, tessConfig);              try {              tessConfig.setLanguage("deu");          } catch (Exception e) {              System.out.println("Error setting language - " + e.getMessage());              throw e;          }          tessConfig.setEnableImagePreprocessing(true);          tessConfig.setApplyRotation(true);          tessConfig.setResize(200);          log.info("enableImageProcessing: " + tessConfig.isEnableImagePreprocessing());          log.info("apply rotation: " + tessConfig.isApplyRotation());          log.info("resize: " + tessConfig.getResize());            pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_ONLY);          pdfConfig.setExtractInlineImages(false);          log.info("PDF Extract inline images: " + pdfConfig.isExtractInlineImages());          log.info("PDF OCR Strategy: " + pdfConfig.getOcrStrategy());          log.info("PDF OCR DPI: " + pdfConfig.getOcrDPI());            final LanguageHandler languageHandler = new LanguageHandler();          ContentHandler bodyContentHandler = new BodyContentHandler();          ContentHandler xmlHandler = new ToXMLContentHandler();          ContentHandler tee = new TeeContentHandler(languageHandler, bodyContentHandler, xmlHandler);          Metadata metadata = new Metadata();              try (TikaInputStream stream = TikaInputStream.get(Paths.get(file))) {              log.info("calling parse on " + file);                parser.parse(stream, tee, metadata, parseContext);          } catch (Exception e) {              e.printStackTrace();          }            String str = xmlHandler.toString();              log.info("language: " + languageHandler.getLanguage());          identifyLanguage(str);          return str;      }        public static void identifyLanguage(String text) throws IOException {          final LanguageDetector detector = new OptimaizeLangDetector();          detector.loadModels();            log.info("Language: " + detector.detectAll(text));      }          public static void main(String[] args) throws TikaException, SAXException, IOException {          final StopWatch watch = new StopWatch();          watch.start();           String file = "c:\\testFiles\\sample german image.pdf";]


[[main] INFO org.torchai.TikaOCRParser - enableImageProcessing: true  [main] INFO org.torchai.TikaOCRParser - apply rotation: true  [main] INFO org.torchai.TikaOCRParser - resize: 200  [main] INFO org.torchai.TikaOCRParser - PDF Extract inline images: false  [main] INFO org.torchai.TikaOCRParser - PDF OCR Strategy: OCR_ONLY  [main] INFO org.torchai.TikaOCRParser - PDF OCR DPI: 300  [main] INFO org.torchai.TikaOCRParser - calling parse on c:\testFiles\sample german image.pdf  [main] INFO org.apache.tika.parser.ocr.TesseractOCRParser - Tesseract is installed and is being invoked. This can add greatly to processing time.  If you do not want tesseract to be applied to your files see: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr  [main] INFO org.torchai.TikaOCRParser - language: : NONE (0.000000)  [main] INFO org.torchai.TikaOCRParser - Language: [it: MEDIUM (0.857139), en: MEDIUM (0.142860)]  [main] INFO org.torchai.TikaOCRParser - Elapsed time: 57309ms  Text: <html xmlns="http://www.w3.org/1999/xhtml">  <head>  <meta name="pdf:PDFVersion" content="1.7" />  <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365" />  <meta name="pdf:hasXFA" content="false" />  <meta name="access_permission:modify_annotations" content="true" />  <meta name="access_permission:can_print_degraded" content="true" />  <meta name="dc:creator" content="Michele Stutz" />  <meta name="dcterms:created" content="2021-09-22T20:14:08Z" />  <meta name="dcterms:modified" content="2021-09-22T20:14:08Z" />  <meta name="dc:format" content="application/pdf; version=1.7" />  <meta name="xmpMM:DocumentID" content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D" />  <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 365" />  <meta name="access_permission:fill_in_form" content="true" />  <meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z" />  <meta name="pdf:encrypted" content="false" />  <meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z" />  <meta name="pdf:hasMarkedContent" content="true" />  <meta name="Content-Type" content="application/pdf" />  <meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z" />  <meta name="pdf:docinfo:creator" content="Michele Stutz" />  <meta name="dc:language" content="en-US" />  <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365" />  <meta name="access_permission:extract_for_accessibility" content="true" />  <meta name="access_permission:assemble_document" content="true" />  <meta name="xmpTPg:NPages" content="1" />  <meta name="pdf:hasXMP" content="true" />  <meta name="access_permission:extract_content" content="true" />  <meta name="access_permission:can_print" content="true" />  <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.CompositeParser" />  <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />  <meta name="access_permission:can_modify" content="true" />  <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365" />  <meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z" />  <title></title>  </head>  <body><div class="page"><div class="ocr" />    </div>  </body></html>    Process finished with exit code 0]






Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <ta...@apache.org>
Sent: Friday, September 24, 2021 10:09 AM
To: Peter Kronenberg <pe...@torch.ai>
Cc: user@tika.apache.org
Subj
I'm having luck with 2.1.0's app.  How are you calling Tika?  What configurations do you have?  Is tesseract on your command line, etc?


java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf

INFO  [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser Tesseract is installed and is being invoked. This can add greatly to processing time.  If you do not want tesseract to be applied to your files see: https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr<https://us-east-2.protection.sophos.com?d=apache.org&u=aHR0cHM6Ly9jd2lraS5hcGFjaGUub3JnL2NvbmZsdWVuY2UvZGlzcGxheS9USUtBL1Rpa2FPQ1IjVGlrYU9DUi1kaXNhYmxlLW9jcg==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=NHdqZmZpOTlPcWwwUnptZDZjM2VWOTI2ampCRlNpYVYwRDZXSVZGZTBXTT0=&h=b7ebb0b4aa7143c7b72030c077559069>

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml<https://us-east-2.protection.sophos.com?d=w3.org&u=aHR0cDovL3d3dy53My5vcmcvMTk5OS94aHRtbA==&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=SEhWWFZLN1BHMjVlWXplUEZlVFBERFZQUFB0M05pUmlMK2J3cTdQdE1SQT0=&h=b7ebb0b4aa7143c7b72030c077559069>">

<head>

<meta name="pdf:PDFVersion" content="1.7"/>

<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:hasXFA" content="false"/>

<meta name="access_permission:modify_annotations" content="true"/>

<meta name="access_permission:can_print_degraded" content="true"/>

<meta name="dc:creator" content="Michele Stutz"/>

<meta name="dcterms:created" content="2021-09-22T20:14:08Z"/>

<meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/>

<meta name="dc:format" content="application/pdf; version=1.7"/>

<meta name="xmpMM:DocumentID" content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/>

<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 365"/>

<meta name="access_permission:fill_in_form" content="true"/>

<meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/>

<meta name="pdf:encrypted" content="false"/>

<meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/>

<meta name="Content-Length" content="38927"/>

<meta name="pdf:hasMarkedContent" content="true"/>

<meta name="Content-Type" content="application/pdf"/>

<meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/>

<meta name="pdf:docinfo:creator" content="Michele Stutz"/>

<meta name="dc:language" content="en-US"/>

<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="access_permission:extract_for_accessibility" content="true"/>

<meta name="access_permission:assemble_document" content="true"/>

<meta name="xmpTPg:NPages" content="1"/>

<meta name="resourceName" content="sample german image.pdf"/>

<meta name="pdf:hasXMP" content="true"/>

<meta name="access_permission:extract_content" content="true"/>

<meta name="access_permission:can_print" content="true"/>

<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser"/>

<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>

<meta name="access_permission:can_modify" content="true"/>

<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/>

<title/>

</head>

<body><div class="page"><p/>

<p> </p>

<p/>

<div class="ocr">Armin Laschet will an die Spitze und kampft



Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht unter Druck.

Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet nun

kampferisch und warnt vor einem Linksruck.

</div>



</div>

On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg <pe...@torch.ai>> wrote:

Ok this is one of those situations where I must be doing something stupid, but I can’t get Tika to properly process the attached file.  It’s an image based PDF.  It’s just not getting any text out of it.  Even if I run with OCRStrategy = ONLY_OCR.

It’s definitely getting to the call to doOCROnCurrentPage(AUTO)in AbstractPDF2XHTML, so it’s not a matter of the character counts preventing the OCR.

Don’t think it has anything to do with the fact that it is in German.  Tried setting the language to DEU, but same results

What is going on?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=b7ebb0b4aa7143c7b72030c077559069>

Re: Problem running OCR

Posted by Tim Allison <ta...@apache.org>.

I'm having luck with 2.1.0's app.  How are you calling Tika?  What
configurations do you have?  Is tesseract on your command line, etc?

java -jar tika-app-2.1.0.jar ~/Downloads/sample\ german\ image.pdf

INFO  [main] 10:07:23,958 org.apache.tika.parser.ocr.TesseractOCRParser
Tesseract is installed and is being invoked. This can add greatly to
processing time.  If you do not want tesseract to be applied to your files
see:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR#TikaOCR-disable-ocr

<?xml version="1.0" encoding="UTF-8"?><html xmlns="
http://www.w3.org/1999/xhtml">

<head>

<meta name="pdf:PDFVersion" content="1.7"/>

<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365"/>

<meta name="pdf:hasXFA" content="false"/>

<meta name="access_permission:modify_annotations" content="true"/>

<meta name="access_permission:can_print_degraded" content="true"/>

<meta name="dc:creator" content="Michele Stutz"/>

<meta name="dcterms:created" content="2021-09-22T20:14:08Z"/>

<meta name="dcterms:modified" content="2021-09-22T20:14:08Z"/>

<meta name="dc:format" content="application/pdf; version=1.7"/>

<meta name="xmpMM:DocumentID"
content="uuid:20CA6E61-9351-4A15-AB8D-4AAD17399C3D"/>

<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for
Microsoft 365"/>

<meta name="access_permission:fill_in_form" content="true"/>

<meta name="pdf:docinfo:modified" content="2021-09-22T20:14:08Z"/>

<meta name="pdf:encrypted" content="false"/>

<meta name="xmp:CreateDate" content="2021-09-22T15:14:08Z"/>

<meta name="Content-Length" content="38927"/>

<meta name="pdf:hasMarkedContent" content="true"/>

<meta name="Content-Type" content="application/pdf"/>

<meta name="xmp:ModifyDate" content="2021-09-22T15:14:08Z"/>

<meta name="pdf:docinfo:creator" content="Michele Stutz"/>

<meta name="dc:language" content="en-US"/>

<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365"/>

<meta name="access_permission:extract_for_accessibility" content="true"/>

<meta name="access_permission:assemble_document" content="true"/>

<meta name="xmpTPg:NPages" content="1"/>

<meta name="resourceName" content="sample german image.pdf"/>

<meta name="pdf:hasXMP" content="true"/>

<meta name="access_permission:extract_content" content="true"/>

<meta name="access_permission:can_print" content="true"/>

<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.DefaultParser"/>

<meta name="X-TIKA:Parsed-By"
content="org.apache.tika.parser.pdf.PDFParser"/>

<meta name="access_permission:can_modify" content="true"/>

<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft
365"/>

<meta name="pdf:docinfo:created" content="2021-09-22T20:14:08Z"/>

<title/>

</head>

<body><div class="page"><p/>

<p> </p>

<p/>

<div class="ocr">Armin Laschet will an die Spitze und kampft


Armin Laschet will auf Kanzlerin Merkel folgen. Doch der CDU-Chef steht
unter Druck.

Umfragen sehen ihn abgeschlagen. Im Wahlkampf-Endspurt gibt sich Laschet nun

kampferisch und warnt vor einem Linksruck.

</div>


</div>

On Wed, Sep 22, 2021 at 9:33 PM Peter Kronenberg <pe...@torch.ai>
wrote:

> Ok this is one of those situations where I must be doing something stupid, but I can’t get Tika to properly process the attached file.  It’s an image based PDF.  It’s just not getting any text out of it.  Even if I run with OCRStrategy = ONLY_OCR.
>
> It’s definitely getting to the call to doOCROnCurrentPage(*AUTO*)in AbstractPDF2XHTML, so it’s not a matter of the character counts preventing the OCR.
>
>
>
> Don’t think it has anything to do with the fact that it is in German.
> Tried setting the language to DEU, but same results
>
>
>
> What is going on?
>
>
>
> *Peter Kronenberg*  *| * *Senior AI Analytic ENGINEER *
>
> *C: 703.887.5623*
>
> [image: Torch AI] <http://www.torch.ai/>
>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI <http://www.torch.ai/>
>
>
>
>
>