You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2020/12/31 14:58:33 UTC

OCR on PDFs

I've got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice.  Is there a way to avoid this?  Even if it has to make two passes, one for the straight text and then another for just the images

Re: {EXTERNAL}OCR on PDFs

Posted by Tim Allison <ta...@apache.org>.
> I think the original question I was asking is the fact that I get
duplication with OCR_AND_TEXT_EXTRACTION.  So this is expected?
Yes.  Given that we didn't have time/resources to do OCR
correctly/carefully, this is expected.

>I guess my only concern with OCR’ing the entire page is that OCR is never
going to be as accurate as extracting text, right?  Have you seen this as
an issue? Or is Tesseract pretty accurate (I guess if you’re turning clean
text into an image, it would be pretty accurate, as opposed to something
that was actually scanned and might not be as clean)

If there's a problem with the PDF or the fonts or the unicode mapping or
it's a Tuesday (joking....maybe?), OCR can be more accurate than the
electronic text.  tesseract is surprisingly good on general English if the
image is in decent shape, etc...  However, you're generally right, that the
electronic text _should_ be more accurate.  The other huge drawback to OCR
is that it requires quite a bit in resources to render the page and then
run OCR.  If you want to run experiments on your own documents, I can help
you run the tika-eval module to assess extraction quality.

If EnableImageProcessing is true, then OCR_Strategy is ignored, is that
right?


>And again, to clarify, OCR Strategy of Auto means NO_OCR but if there is
not much text, then switch to OCR_ONLY, correct?
Almost, the problem is that we've already written to the handler whatever
we got for that page by the time we make the determination.  So, AUTO means
NO_OCR but if there is not much text OCR_AND_TEXT.

> Tika 2.0.0 looks good.  Is that available yet for testing?
If you want to build locally, y.  We're planning an ALPHA release in the
next few weeks.


>What exactly do you mean by this:

>>>There's no need any more to specify the embedded parser in the
ParseContext.  We automatically use the AutoDetectParser as configured to
parse embedded documents.
At some point, the default was parse embedded objects.  Then a change was
made (in 1.7?) that required users to pass in the parser to use for
embedded documents....this was before my time on the project.  This had the
effect that embedded files were not parsed for lots of folks, including us
in tika-server... see: https://issues.apache.org/jira/browse/TIKA-1584). We
then made a change to the default behavior to add AutoDetectParser to parse
embedded documents unless a user specified their own (
https://issues.apache.org/jira/browse/TIKA-2096).


>I believe my code was similar to yours.  I want to make sure I’m doing it
correctly. I create a PDFParserConfig and TesseractOCRConfig, set my
options and then add the config to the parseContext.  Is that right?
Y.

On Mon, Jan 4, 2021 at 11:59 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> So I think the original question I was asking is the fact that I get duplication with *OCR_AND_TEXT_EXTRACTION.*  So this is expected?
>
> I guess my only concern with OCR’ing the entire page is that OCR is never going to be as accurate as extracting text, right?  Have you seen this as an issue? Or is Tesseract pretty accurate (I guess if you’re turning clean text into an image, it would be pretty accurate, as opposed to something that was actually scanned and might not be as clean)
>
>
>
> If EnableImageProcessing is true, then OCR_Strategy is ignored, is that
> right?
>
>
>
> And again, to clarify, OCR Strategy of Auto means NO_OCR but if there is
> not much text, then switch to OCR_ONLY, correct?
>
>
>
> *From:* Peter Kronenberg
> *Sent:* Monday, January 4, 2021 11:29 AM
> *To:* user@tika.apache.org; tallison@apache.org
> *Subject:* RE: {EXTERNAL}OCR on PDFs
>
>
>
> Tika 2.0.0 looks good.  Is that available yet for testing?
>
>
>
> What exactly do you mean by this:
>
> *>>There's no need any more to specify the embedded parser in the
> ParseContext.  We automatically use the AutoDetectParser as configured to
> parse embedded documents.*
>
>
>
> I believe my code was similar to yours.  I want to make sure I’m doing it
> correctly. I create a PDFParserConfig and TesseractOCRConfig, set my
> options and then add the config to the parseContext.  Is that right?
>
>
>
>
>
>
>
> *From:* Tim Allison <ta...@apache.org>
> *Sent:* Monday, January 4, 2021 10:58 AM
> *To:* user@tika.apache.org
> *Subject:* Re: {EXTERNAL}OCR on PDFs
>
>
>
> To confirm, you aren't getting any OCR on "Simple-text-image.docx" or on
> "Simple-test-image.pdf"?
>
>
>
> As we mention on the wiki[1], there are two OCR strategies for PDFs, and
> you have to pick one of them to get OCR to work with PDFs currently, but
> see [2] for Tika 2.0.0:
>
>
>
> a) extractInlineImages -- this extracts all inline images and runs OCR
> against each image...this can be a bad idea
>
> b) OcrStrategy -- this renders each page and then will run OCR against the
> rendered image.
>
>
>
> That said, neither of those should be affecting a docx image.
>
>
>
> There's no need any more to specify the embedded parser in the
> ParseContext.  We automatically use the AutoDetectParser as configured to
> parse embedded documents.
>
>
>
> If you use the ToXMLHandler, you'll get output about which parsers were
> used, and you'll get <div> markup for page breaks and for ocr.
>
>
>
> With this code:
>
> PDFParserConfig config = new PDFParserConfig();
> config.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.*OCR_AND_TEXT_EXTRACTION*);
> ParseContext parseContext = new ParseContext();
> parseContext.set(PDFParserConfig.class, config);
>
> Parser p = new AutoDetectParser();
> ContentHandler handler = new ToXMLContentHandler();
> Metadata metadata = new Metadata();
> Path path = Paths.*get*("/..fill.in.../Simple-text-image.pdf");
> try (InputStream tis = TikaInputStream.*get*(path, metadata)) {
>     p.parse(tis, handler, metadata, parseContext);
> }
> System.*out*.println(handler.toString());
>
> I get this:
>
>
> <html xmlns="http://www.w3.org/1999/xhtml">
> <head>
> <meta name="date" content="2020-12-31T20:08:09Z" />
> <meta name="pdf:PDFVersion" content="1.7" />
> <meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365" />
> <meta name="pdf:hasXFA" content="false" />
> <meta name="access_permission:modify_annotations" content="true" />
> <meta name="access_permission:can_print_degraded" content="true" />
> <meta name="dc:creator" content="Peter Kronenberg" />
> <meta name="language" content="en-US" />
> <meta name="dcterms:created" content="2020-12-31T20:08:09Z" />
> <meta name="Last-Modified" content="2020-12-31T20:08:09Z" />
> <meta name="dcterms:modified" content="2020-12-31T20:08:09Z" />
> <meta name="dc:format" content="application/pdf; version=1.7" />
> <meta name="xmpMM:DocumentID"
> content="uuid:2C3CDC87-5F9C-49B7-9E4F-0E16A7AE27BC" />
> <meta name="Last-Save-Date" content="2020-12-31T20:08:09Z" />
> <meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for
> Microsoft 365" />
> <meta name="access_permission:fill_in_form" content="true" />
> <meta name="pdf:docinfo:modified" content="2020-12-31T20:08:09Z" />
> <meta name="meta:save-date" content="2020-12-31T20:08:09Z" />
> <meta name="pdf:encrypted" content="false" />
> <meta name="xmp:CreateDate" content="2020-12-31T15:08:09Z" />
> <meta name="modified" content="2020-12-31T20:08:09Z" />
> <meta name="Content-Length" content="47113" />
> <meta name="pdf:hasMarkedContent" content="true" />
> <meta name="Content-Type" content="application/pdf" />
> <meta name="xmp:ModifyDate" content="2020-12-31T15:08:09Z" />
> <meta name="pdf:docinfo:creator" content="Peter Kronenberg" />
> <meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
> <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />
> <meta name="X-Parsed-By" content="class
> org.apache.tika.parser.ocr.TesseractOCRParser" />
> <meta name="creator" content="Peter Kronenberg" />
> <meta name="dc:language" content="en-US" />
> <meta name="meta:author" content="Peter Kronenberg" />
> <meta name="pdf:producer" content="Microsoft® Word for Microsoft 365" />
> <meta name="meta:creation-date" content="2020-12-31T20:08:09Z" />
> <meta name="created" content="2020-12-31T20:08:09Z" />
> <meta name="access_permission:extract_for_accessibility" content="true" />
> <meta name="access_permission:assemble_document" content="true" />
> <meta name="xmpTPg:NPages" content="2" />
> <meta name="Creation-Date" content="2020-12-31T20:08:09Z" />
> <meta name="resourceName" content="Simple-text-image.pdf" />
> <meta name="pdf:hasXMP" content="true" />
> <meta name="access_permission:extract_content" content="true" />
> <meta name="access_permission:can_print" content="true" />
> <meta name="Author" content="Peter Kronenberg" />
> <meta name="producer" content="Microsoft® Word for Microsoft 365" />
> <meta name="access_permission:can_modify" content="true" />
> <meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft
> 365" />
> <meta name="pdf:docinfo:created" content="2020-12-31T20:08:09Z" />
> <title></title>
> </head>
> <body><div class="page"><p />
> <p>Start of text
> </p>
> <p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
> tempor incididunt ut
> </p>
> <p>labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis
> nunc sed augue lacus. Et netus
> </p>
> <p>et malesuada fames. Nunc sed id semper risus in hendrerit gravida.
> Scelerisque fermentum dui
> </p>
> <p>faucibus in ornare quam viverra orci. Dolor morbi non arcu risus.
> Pharetra massa massa ultricies
> </p>
> <p>mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl suscipit
> adipiscing. Auctor
> </p>
> <p>augue mauris augue neque gravida in fermentum et sollicitudin. Sed
> vulputate mi sit amet mauris
> </p>
> <p>commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget. Rutrum
> quisque non tellus
> </p>
> <p>orci ac auctor augue.
> </p>
> <p>Phasellus faucibus scelerisque eleifend donec pretium vulputate sapien
> nec sagittis. Vestibulum
> </p>
> <p>rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt
> lobortis. Diam ut venenatis
> </p>
> <p>tellus in metus vulputate eu scelerisque felis. Nulla malesuada
> pellentesque elit eget gravida cum
> </p>
> <p>sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt
> lobortis. Dictum varius duis at
> </p>
> <p>consectetur lorem donec massa sapien faucibus. Integer malesuada nunc
> vel risus. Sit amet
> </p>
> <p>consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc non
> blandit massa enim nec dui
> </p>
> <p>nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis at
> tellus at urna
> </p>
> <p>condimentum mattis pellentesque id. Egestas tellus rutrum tellus
> pellentesque eu tincidunt tortor
> </p>
> <p>aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus
> urna.
> </p>
> <p>Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae
> ultricies leo. Gravida neque
> </p>
> <p>convallis a cras. Enim nec dui nunc mattis. Non odio euismod lacinia at
> quis risus sed.
> </p>
> <p>Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id
> semper risus in hendrerit
> </p>
> <p>gravida rutrum. Mi bibendum neque egestas congue quisque egestas diam
> in. Pharetra sit amet
> </p>
> <p>aliquam id diam maecenas ultricies mi. Semper risus in hendrerit
> gravida rutrum quisque non
> </p>
> <p>tellus orci.
> </p>
> <p>Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non odio
> euismod. Mollis
> </p>
> <p>aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat
> nibh. Tristique senectus et
> </p>
> <p>netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet
> consectetur adipiscing elit ut
> </p>
> <p>aliquam purus. Sollicitudin ac orci phasellus egestas tellus. Gravida
> in fermentum et sollicitudin
> </p>
> <p>ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat
> maecenas volutpat blandit
> </p>
> <p>aliquam etiam erat. Sed blandit libero volutpat sed cras ornare arcu
> dui. Interdum posuere lorem
> </p>
> <p>ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus dolor
> purus non. Ac turpis egestas
> </p>
> <p>sed tempus urna. Nam aliquam sem et tortor consequat id porta.
> </p>
> <p>End of text
> </p>
> <p>
> </p>
> <p>  </p>
> <p />
> <div class="ocr">Start of text
>
> Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
> tempor incididunt ut
> labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis nunc
> sed augue lacus. Et netus
> et malesuada fames. Nunc sed id semper risus in hendrerit gravida.
> Scelerisque fermentum dui
> faucibus in ornare quam viverra orci. Dolor morbi non arcu risus. Pharetra
> massa massa ultricies
> mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl suscipit
> adipiscing. Auctor
> augue mauris augue neque gravida in fermentum et sollicitudin. Sed
> vulputate mi sit amet mauris
> commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget. Rutrum
> quisque non tellus
> orci ac auctor augue.
>
> Phasellus faucibus scelerisque eleifend donec pretium vulputate sapien nec
> sagittis. Vestibulum
> rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt
> lobortis. Diam ut venenatis
> tellus in metus vulputate eu scelerisque felis. Nulla malesuada
> pellentesque elit eget gravida cum
> sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt
> lobortis. Dictum varius duis at
> consectetur lorem donec massa sapien faucibus. Integer malesuada nunc vel
> risus. Sit amet
> consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc non
> blandit massa enim nec dui
> nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis at
> tellus at urna
> condimentum mattis pellentesque id. Egestas tellus rutrum tellus
> pellentesque eu tincidunt tortor
> aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus urna.
>
> Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae
> ultricies leo. Gravida neque
> convallis a cras. Enim nec dui nunc mattis. Non odio euismod lacinia at
> quis risus sed.
> Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id semper
> risus in hendrerit
> gravida rutrum. Mi bibendum neque egestas congue quisque egestas diam in.
> Pharetra sit amet
> aliquam id diam maecenas ultricies mi. Semper risus in hendrerit gravida
> rutrum quisque non
> tellus orci.
>
> Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non odio
> euismod. Mollis
> aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat nibh.
> Tristique senectus et
> netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet
> consectetur adipiscing elit ut
> aliquam purus. Sollicitudin ac orci phasellus egestas tellus. Gravida in
> fermentum et sollicitudin
> ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat
> maecenas volutpat blandit
> aliquam etiam erat. Sed blandit libero volutpat sed cras ornare arcu dui.
> Interdum posuere lorem
> ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus dolor purus
> non. Ac turpis egestas
> sed tempus urna. Nam aliquam sem et tortor consequat id porta.
>
> End of text
> </div>
> </div>
> <div class="page"><p />
> <p>Start of image
> </p>
> <p>End of image
> </p>
> <p> </p>
> <p />
> <div class="ocr">Start of image
>
> Pellentesque adipiscing commodo elit at imperdiet dui. Consectetur purus
> ut faucibus pulvinar. Tincidunt
> praesent semper feugiat nibh sed pulvinar. Sagittis aliquam malesuada
> bibendum arcu vitae elementum
> curabitur vitae. Velit euismod in pellentesque massa placerat duis.
> Fermentum et sollicitudin ac orci
> phasellus egestas tellus. Ante in nibh mauris cursus mattis molestie.
> Commodo quis imperdiet massa
> tincidunt nunc pulvinar sapien et ligula. Lorem mollis aliquam ut
> porttitor leo a diam sollicitudin
>
> tempor. Amet aliquam id diam maecenas ultricies mi eget mauris pharetra.
> Ullamcorper dignissim cras
> tincidunt lobortis feugiat vivamus at augue eget. Orci eu lobortis
> elementum nibh tellus. Id aliquet
>
> lectus proin nibh nisl condimentum. Vitae elementum curabitur vitae nunc
> sed velit. Rnoncus dolor purus
> non enim praesent elementum facilisis leo vel. Velit egestas dui id ornare
> arcu odio ut sem nulla. Purus
> sit amet luctus venenatis lectus magna fringilla. Maecenas sed enim ut sem.
>
> End of image
> </div>
> </div>
> </body></html>
>
> Process finished with exit code 0
>
>
>
> [1]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066
>
> [2] TIKA-3258
>
>
>
> On Mon, Jan 4, 2021 at 10:15 AM Peter Kronenberg <
> peter.kronenberg@torch.ai> wrote:
>
> I’m still having a problem understanding the options on the PDFParser:
> OCRStrategy and ExtractInlineImages   These options appear to be getting
> ignored.  I’m not seeing any difference.
>
>
>
> Here’s the code I’m using.  I’ve also attached the file I’m testing with.
> Some of my confusion is similar to what was expressed here:
> http://apache-tika-users.1629097.n2.nabble.com/OCR-Strategy-ocr-only-extracts-also-text-td7574798.html,
> but there was never any resolution
>
>
>
> When creating a set of parsers by adding to the ParseContext, how do I
> figure out what parser was ultimately used?
>
>
>
> [image: public class TikaOCRParser { private static final PDFParserConfig
> pdfConfig = new PDFParserConfig(); private static final TesseractOCRConfig
> tessConfig = new TesseractOCRConfig(); private static final
> AutoDetectParser parser = new AutoDetectParser(); private static final
> ParseContext parseContext = new ParseContext(); static {
> parseContext.set(AutoDetectParser.class, parser);
> parseContext.set(PDFParserConfig.class, pdfConfig); //
> parseContext.set(TesseractOCRConfig.class, tessConfig); } public static
> String parse(String file) throws TikaException, SAXException, IOException {
> log.info(String.format("Tesseract path: %s, exists: %s",
> tessConfig.getTesseractPath(), new
> File(tessConfig.getTesseractPath()).exists()));
> log.info(String.format("Tessdata path: %s, exists: %s",
> tessConfig.getTessdataPath(), new
> File(tessConfig.getTessdataPath()).exists()));
> log.info(String.format("Image Magick path: %s, exists: %s",
> tessConfig.getImageMagickPath(), new
> File(tessConfig.getImageMagickPath()).exists()));
> log.info("enableImageProcessing: " + tessConfig.isEnableImageProcessing());
> pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
> pdfConfig.setExtractInlineImages(false); log.info("PDF Extract inline
> images: " + pdfConfig.getExtractInlineImages()); log.info("PDF OCR
> Strategy: " + pdfConfig.getOcrStrategy()); log.info("PDF OCR DPI: " +
> pdfConfig.getOcrDPI()); log.info("PDF Detect angles: " +
> pdfConfig.getDetectAngles()); ContentHandler handler = new
> BodyContentHandler(-1); Metadata metadata = new Metadata(); try
> (TikaInputStream stream = TikaInputStream.get(new BufferedInputStream(new
> FileInputStream(file)))) { log.info("calling parse on " + file);
> parser.parse(stream, handler, metadata, parseContext); }
> //Arrays.stream(metadata.names()).filter(n ->
> !metadata.get(n).isEmpty()).forEach(n -> log.info(String.format("%s: %s",
> n, metadata.get(n)))); return handler.toString(); } public static void
> main(String[] args) throws TikaException, SAXException, IOException {
> String file = "c:\\testFiles\\Simple-text-image.docx";
> System.out.println("Text: " + parse(file)); } }]
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Thursday, December 31, 2020 9:59 AM
> *To:* user@tika.apache.org
> *Subject:* {EXTERNAL}OCR on PDFs
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> CAUTION: This email originated from outside of the organization. DO NOT
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
> I’ve got Tika working with Tesseract on PDF files, but it seems that if I
> give it a PDF file that has both searchable text and images, the text is
> OCRed twice.  Is there a way to avoid this?  Even if it has to make two
> passes, one for the straight text and then another for just the images
>
>

RE: {EXTERNAL}OCR on PDFs

Posted by Peter Kronenberg <pe...@torch.ai>.
So I think the original question I was asking is the fact that I get duplication with OCR_AND_TEXT_EXTRACTION.  So this is expected?

I guess my only concern with OCR’ing the entire page is that OCR is never going to be as accurate as extracting text, right?  Have you seen this as an issue? Or is Tesseract pretty accurate (I guess if you’re turning clean text into an image, it would be pretty accurate, as opposed to something that was actually scanned and might not be as clean)

If EnableImageProcessing is true, then OCR_Strategy is ignored, is that right?

And again, to clarify, OCR Strategy of Auto means NO_OCR but if there is not much text, then switch to OCR_ONLY, correct?

From: Peter Kronenberg
Sent: Monday, January 4, 2021 11:29 AM
To: user@tika.apache.org; tallison@apache.org
Subject: RE: {EXTERNAL}OCR on PDFs

Tika 2.0.0 looks good.  Is that available yet for testing?

What exactly do you mean by this:
>>There's no need any more to specify the embedded parser in the ParseContext.  We automatically use the AutoDetectParser as configured to parse embedded documents.

I believe my code was similar to yours.  I want to make sure I’m doing it correctly. I create a PDFParserConfig and TesseractOCRConfig, set my options and then add the config to the parseContext.  Is that right?



From: Tim Allison <ta...@apache.org>>
Sent: Monday, January 4, 2021 10:58 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: {EXTERNAL}OCR on PDFs

To confirm, you aren't getting any OCR on "Simple-text-image.docx" or on "Simple-test-image.pdf"?

As we mention on the wiki[1], there are two OCR strategies for PDFs, and you have to pick one of them to get OCR to work with PDFs currently, but see [2] for Tika 2.0.0:

a) extractInlineImages -- this extracts all inline images and runs OCR against each image...this can be a bad idea
b) OcrStrategy -- this renders each page and then will run OCR against the rendered image.

That said, neither of those should be affecting a docx image.

There's no need any more to specify the embedded parser in the ParseContext.  We automatically use the AutoDetectParser as configured to parse embedded documents.

If you use the ToXMLHandler, you'll get output about which parsers were used, and you'll get <div> markup for page breaks and for ocr.

With this code:

PDFParserConfig config = new PDFParserConfig();
config.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, config);

Parser p = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
Path path = Paths.get("/..fill.in.../Simple-text-image.pdf");
try (InputStream tis = TikaInputStream.get(path, metadata)) {
    p.parse(tis, handler, metadata, parseContext);
}
System.out.println(handler.toString());
I get this:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:PDFVersion" content="1.7" />
<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:hasXFA" content="false" />
<meta name="access_permission:modify_annotations" content="true" />
<meta name="access_permission:can_print_degraded" content="true" />
<meta name="dc:creator" content="Peter Kronenberg" />
<meta name="language" content="en-US" />
<meta name="dcterms:created" content="2020-12-31T20:08:09Z" />
<meta name="Last-Modified" content="2020-12-31T20:08:09Z" />
<meta name="dcterms:modified" content="2020-12-31T20:08:09Z" />
<meta name="dc:format" content="application/pdf; version=1.7" />
<meta name="xmpMM:DocumentID" content="uuid:2C3CDC87-5F9C-49B7-9E4F-0E16A7AE27BC" />
<meta name="Last-Save-Date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 365" />
<meta name="access_permission:fill_in_form" content="true" />
<meta name="pdf:docinfo:modified" content="2020-12-31T20:08:09Z" />
<meta name="meta:save-date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:encrypted" content="false" />
<meta name="xmp:CreateDate" content="2020-12-31T15:08:09Z" />
<meta name="modified" content="2020-12-31T20:08:09Z" />
<meta name="Content-Length" content="47113" />
<meta name="pdf:hasMarkedContent" content="true" />
<meta name="Content-Type" content="application/pdf" />
<meta name="xmp:ModifyDate" content="2020-12-31T15:08:09Z" />
<meta name="pdf:docinfo:creator" content="Peter Kronenberg" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />
<meta name="X-Parsed-By" content="class org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="creator" content="Peter Kronenberg" />
<meta name="dc:language" content="en-US" />
<meta name="meta:author" content="Peter Kronenberg" />
<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="meta:creation-date" content="2020-12-31T20:08:09Z" />
<meta name="created" content="2020-12-31T20:08:09Z" />
<meta name="access_permission:extract_for_accessibility" content="true" />
<meta name="access_permission:assemble_document" content="true" />
<meta name="xmpTPg:NPages" content="2" />
<meta name="Creation-Date" content="2020-12-31T20:08:09Z" />
<meta name="resourceName" content="Simple-text-image.pdf" />
<meta name="pdf:hasXMP" content="true" />
<meta name="access_permission:extract_content" content="true" />
<meta name="access_permission:can_print" content="true" />
<meta name="Author" content="Peter Kronenberg" />
<meta name="producer" content="Microsoft® Word for Microsoft 365" />
<meta name="access_permission:can_modify" content="true" />
<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:docinfo:created" content="2020-12-31T20:08:09Z" />
<title></title>
</head>
<body><div class="page"><p />
<p>Start of text
</p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
</p>
<p>labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis nunc sed augue lacus. Et netus
</p>
<p>et malesuada fames. Nunc sed id semper risus in hendrerit gravida. Scelerisque fermentum dui
</p>
<p>faucibus in ornare quam viverra orci. Dolor morbi non arcu risus. Pharetra massa massa ultricies
</p>
<p>mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl suscipit adipiscing. Auctor
</p>
<p>augue mauris augue neque gravida in fermentum et sollicitudin. Sed vulputate mi sit amet mauris
</p>
<p>commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget. Rutrum quisque non tellus
</p>
<p>orci ac auctor augue.
</p>
<p>Phasellus faucibus scelerisque eleifend donec pretium vulputate sapien nec sagittis. Vestibulum
</p>
<p>rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Diam ut venenatis
</p>
<p>tellus in metus vulputate eu scelerisque felis. Nulla malesuada pellentesque elit eget gravida cum
</p>
<p>sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Dictum varius duis at
</p>
<p>consectetur lorem donec massa sapien faucibus. Integer malesuada nunc vel risus. Sit amet
</p>
<p>consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc non blandit massa enim nec dui
</p>
<p>nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis at tellus at urna
</p>
<p>condimentum mattis pellentesque id. Egestas tellus rutrum tellus pellentesque eu tincidunt tortor
</p>
<p>aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus urna.
</p>
<p>Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae ultricies leo. Gravida neque
</p>
<p>convallis a cras. Enim nec dui nunc mattis. Non odio euismod lacinia at quis risus sed.
</p>
<p>Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id semper risus in hendrerit
</p>
<p>gravida rutrum. Mi bibendum neque egestas congue quisque egestas diam in. Pharetra sit amet
</p>
<p>aliquam id diam maecenas ultricies mi. Semper risus in hendrerit gravida rutrum quisque non
</p>
<p>tellus orci.
</p>
<p>Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non odio euismod. Mollis
</p>
<p>aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat nibh. Tristique senectus et
</p>
<p>netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet consectetur adipiscing elit ut
</p>
<p>aliquam purus. Sollicitudin ac orci phasellus egestas tellus. Gravida in fermentum et sollicitudin
</p>
<p>ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat maecenas volutpat blandit
</p>
<p>aliquam etiam erat. Sed blandit libero volutpat sed cras ornare arcu dui. Interdum posuere lorem
</p>
<p>ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus dolor purus non. Ac turpis egestas
</p>
<p>sed tempus urna. Nam aliquam sem et tortor consequat id porta.
</p>
<p>End of text
</p>
<p>
</p>
<p>  </p>
<p />
<div class="ocr">Start of text

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis nunc sed augue lacus. Et netus
et malesuada fames. Nunc sed id semper risus in hendrerit gravida. Scelerisque fermentum dui
faucibus in ornare quam viverra orci. Dolor morbi non arcu risus. Pharetra massa massa ultricies
mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl suscipit adipiscing. Auctor
augue mauris augue neque gravida in fermentum et sollicitudin. Sed vulputate mi sit amet mauris
commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget. Rutrum quisque non tellus
orci ac auctor augue.

Phasellus faucibus scelerisque eleifend donec pretium vulputate sapien nec sagittis. Vestibulum
rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Diam ut venenatis
tellus in metus vulputate eu scelerisque felis. Nulla malesuada pellentesque elit eget gravida cum
sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Dictum varius duis at
consectetur lorem donec massa sapien faucibus. Integer malesuada nunc vel risus. Sit amet
consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc non blandit massa enim nec dui
nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis at tellus at urna
condimentum mattis pellentesque id. Egestas tellus rutrum tellus pellentesque eu tincidunt tortor
aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus urna.

Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae ultricies leo. Gravida neque
convallis a cras. Enim nec dui nunc mattis. Non odio euismod lacinia at quis risus sed.
Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id semper risus in hendrerit
gravida rutrum. Mi bibendum neque egestas congue quisque egestas diam in. Pharetra sit amet
aliquam id diam maecenas ultricies mi. Semper risus in hendrerit gravida rutrum quisque non
tellus orci.

Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non odio euismod. Mollis
aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat nibh. Tristique senectus et
netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet consectetur adipiscing elit ut
aliquam purus. Sollicitudin ac orci phasellus egestas tellus. Gravida in fermentum et sollicitudin
ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat maecenas volutpat blandit
aliquam etiam erat. Sed blandit libero volutpat sed cras ornare arcu dui. Interdum posuere lorem
ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus dolor purus non. Ac turpis egestas
sed tempus urna. Nam aliquam sem et tortor consequat id porta.

End of text
</div>
</div>
<div class="page"><p />
<p>Start of image
</p>
<p>End of image
</p>
<p> </p>
<p />
<div class="ocr">Start of image

Pellentesque adipiscing commodo elit at imperdiet dui. Consectetur purus ut faucibus pulvinar. Tincidunt
praesent semper feugiat nibh sed pulvinar. Sagittis aliquam malesuada bibendum arcu vitae elementum
curabitur vitae. Velit euismod in pellentesque massa placerat duis. Fermentum et sollicitudin ac orci
phasellus egestas tellus. Ante in nibh mauris cursus mattis molestie. Commodo quis imperdiet massa
tincidunt nunc pulvinar sapien et ligula. Lorem mollis aliquam ut porttitor leo a diam sollicitudin

tempor. Amet aliquam id diam maecenas ultricies mi eget mauris pharetra. Ullamcorper dignissim cras
tincidunt lobortis feugiat vivamus at augue eget. Orci eu lobortis elementum nibh tellus. Id aliquet

lectus proin nibh nisl condimentum. Vitae elementum curabitur vitae nunc sed velit. Rnoncus dolor purus
non enim praesent elementum facilisis leo vel. Velit egestas dui id ornare arcu odio ut sem nulla. Purus
sit amet luctus venenatis lectus magna fringilla. Maecenas sed enim ut sem.

End of image
</div>
</div>
</body></html>

Process finished with exit code 0

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066
[2] TIKA-3258

On Mon, Jan 4, 2021 at 10:15 AM Peter Kronenberg <pe...@torch.ai>> wrote:
I’m still having a problem understanding the options on the PDFParser: OCRStrategy and ExtractInlineImages   These options appear to be getting ignored.  I’m not seeing any difference.

Here’s the code I’m using.  I’ve also attached the file I’m testing with.  Some of my confusion is similar to what was expressed here: http://apache-tika-users.1629097.n2.nabble.com/OCR-Strategy-ocr-only-extracts-also-text-td7574798.html, but there was never any resolution

When creating a set of parsers by adding to the ParseContext, how do I figure out what parser was ultimately used?

[public class TikaOCRParser {        private static final PDFParserConfig pdfConfig = new PDFParserConfig();      private static final TesseractOCRConfig tessConfig = new TesseractOCRConfig();        private static final AutoDetectParser parser = new AutoDetectParser();      private static final ParseContext parseContext = new ParseContext();        static {          parseContext.set(AutoDetectParser.class, parser);          parseContext.set(PDFParserConfig.class, pdfConfig);       //   parseContext.set(TesseractOCRConfig.class, tessConfig);      }        public static String parse(String file) throws TikaException, SAXException, IOException {          log.info(String.format("Tesseract path: %s, exists: %s", tessConfig.getTesseractPath(), new File(tessConfig.getTesseractPath()).exists()));          log.info(String.format("Tessdata path:  %s, exists: %s", tessConfig.getTessdataPath(), new File(tessConfig.getTessdataPath()).exists()));          log.info(String.format("Image Magick path: %s, exists: %s", tessConfig.getImageMagickPath(), new File(tessConfig.getImageMagickPath()).exists()));          log.info("enableImageProcessing: " + tessConfig.isEnableImageProcessing());          pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);          pdfConfig.setExtractInlineImages(false);          log.info("PDF Extract inline images: " + pdfConfig.getExtractInlineImages());          log.info("PDF OCR Strategy: " + pdfConfig.getOcrStrategy());          log.info("PDF OCR DPI: " + pdfConfig.getOcrDPI());          log.info("PDF Detect angles: " + pdfConfig.getDetectAngles());            ContentHandler handler = new BodyContentHandler(-1);          Metadata metadata = new Metadata();          try (TikaInputStream stream = TikaInputStream.get(new BufferedInputStream(new FileInputStream(file)))) {              log.info("calling parse on " + file);              parser.parse(stream, handler, metadata, parseContext);          }          //Arrays.stream(metadata.names()).filter(n -> !metadata.get(n).isEmpty()).forEach(n -> log.info(String.format("%s: %s", n, metadata.get(n))));          return handler.toString();      }            public static void main(String[] args) throws TikaException, SAXException, IOException {          String file = "c:\\testFiles\\Simple-text-image.docx";            System.out.println("Text: " + parse(file));      }  }]
From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, December 31, 2020 9:59 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: {EXTERNAL}OCR on PDFs

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
I’ve got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice.  Is there a way to avoid this?  Even if it has to make two passes, one for the straight text and then another for just the images

RE: {EXTERNAL}OCR on PDFs

Posted by Peter Kronenberg <pe...@torch.ai>.
Tika 2.0.0 looks good.  Is that available yet for testing?

What exactly do you mean by this:
>>There's no need any more to specify the embedded parser in the ParseContext.  We automatically use the AutoDetectParser as configured to parse embedded documents.

I believe my code was similar to yours.  I want to make sure I’m doing it correctly. I create a PDFParserConfig and TesseractOCRConfig, set my options and then add the config to the parseContext.  Is that right?



From: Tim Allison <ta...@apache.org>
Sent: Monday, January 4, 2021 10:58 AM
To: user@tika.apache.org
Subject: Re: {EXTERNAL}OCR on PDFs

To confirm, you aren't getting any OCR on "Simple-text-image.docx" or on "Simple-test-image.pdf"?

As we mention on the wiki[1], there are two OCR strategies for PDFs, and you have to pick one of them to get OCR to work with PDFs currently, but see [2] for Tika 2.0.0:

a) extractInlineImages -- this extracts all inline images and runs OCR against each image...this can be a bad idea
b) OcrStrategy -- this renders each page and then will run OCR against the rendered image.

That said, neither of those should be affecting a docx image.

There's no need any more to specify the embedded parser in the ParseContext.  We automatically use the AutoDetectParser as configured to parse embedded documents.

If you use the ToXMLHandler, you'll get output about which parsers were used, and you'll get <div> markup for page breaks and for ocr.

With this code:

PDFParserConfig config = new PDFParserConfig();
config.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, config);

Parser p = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
Path path = Paths.get("/..fill.in.../Simple-text-image.pdf");
try (InputStream tis = TikaInputStream.get(path, metadata)) {
    p.parse(tis, handler, metadata, parseContext);
}
System.out.println(handler.toString());
I get this:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:PDFVersion" content="1.7" />
<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:hasXFA" content="false" />
<meta name="access_permission:modify_annotations" content="true" />
<meta name="access_permission:can_print_degraded" content="true" />
<meta name="dc:creator" content="Peter Kronenberg" />
<meta name="language" content="en-US" />
<meta name="dcterms:created" content="2020-12-31T20:08:09Z" />
<meta name="Last-Modified" content="2020-12-31T20:08:09Z" />
<meta name="dcterms:modified" content="2020-12-31T20:08:09Z" />
<meta name="dc:format" content="application/pdf; version=1.7" />
<meta name="xmpMM:DocumentID" content="uuid:2C3CDC87-5F9C-49B7-9E4F-0E16A7AE27BC" />
<meta name="Last-Save-Date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 365" />
<meta name="access_permission:fill_in_form" content="true" />
<meta name="pdf:docinfo:modified" content="2020-12-31T20:08:09Z" />
<meta name="meta:save-date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:encrypted" content="false" />
<meta name="xmp:CreateDate" content="2020-12-31T15:08:09Z" />
<meta name="modified" content="2020-12-31T20:08:09Z" />
<meta name="Content-Length" content="47113" />
<meta name="pdf:hasMarkedContent" content="true" />
<meta name="Content-Type" content="application/pdf" />
<meta name="xmp:ModifyDate" content="2020-12-31T15:08:09Z" />
<meta name="pdf:docinfo:creator" content="Peter Kronenberg" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />
<meta name="X-Parsed-By" content="class org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="creator" content="Peter Kronenberg" />
<meta name="dc:language" content="en-US" />
<meta name="meta:author" content="Peter Kronenberg" />
<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="meta:creation-date" content="2020-12-31T20:08:09Z" />
<meta name="created" content="2020-12-31T20:08:09Z" />
<meta name="access_permission:extract_for_accessibility" content="true" />
<meta name="access_permission:assemble_document" content="true" />
<meta name="xmpTPg:NPages" content="2" />
<meta name="Creation-Date" content="2020-12-31T20:08:09Z" />
<meta name="resourceName" content="Simple-text-image.pdf" />
<meta name="pdf:hasXMP" content="true" />
<meta name="access_permission:extract_content" content="true" />
<meta name="access_permission:can_print" content="true" />
<meta name="Author" content="Peter Kronenberg" />
<meta name="producer" content="Microsoft® Word for Microsoft 365" />
<meta name="access_permission:can_modify" content="true" />
<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:docinfo:created" content="2020-12-31T20:08:09Z" />
<title></title>
</head>
<body><div class="page"><p />
<p>Start of text
</p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
</p>
<p>labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis nunc sed augue lacus. Et netus
</p>
<p>et malesuada fames. Nunc sed id semper risus in hendrerit gravida. Scelerisque fermentum dui
</p>
<p>faucibus in ornare quam viverra orci. Dolor morbi non arcu risus. Pharetra massa massa ultricies
</p>
<p>mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl suscipit adipiscing. Auctor
</p>
<p>augue mauris augue neque gravida in fermentum et sollicitudin. Sed vulputate mi sit amet mauris
</p>
<p>commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget. Rutrum quisque non tellus
</p>
<p>orci ac auctor augue.
</p>
<p>Phasellus faucibus scelerisque eleifend donec pretium vulputate sapien nec sagittis. Vestibulum
</p>
<p>rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Diam ut venenatis
</p>
<p>tellus in metus vulputate eu scelerisque felis. Nulla malesuada pellentesque elit eget gravida cum
</p>
<p>sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Dictum varius duis at
</p>
<p>consectetur lorem donec massa sapien faucibus. Integer malesuada nunc vel risus. Sit amet
</p>
<p>consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc non blandit massa enim nec dui
</p>
<p>nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis at tellus at urna
</p>
<p>condimentum mattis pellentesque id. Egestas tellus rutrum tellus pellentesque eu tincidunt tortor
</p>
<p>aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus urna.
</p>
<p>Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae ultricies leo. Gravida neque
</p>
<p>convallis a cras. Enim nec dui nunc mattis. Non odio euismod lacinia at quis risus sed.
</p>
<p>Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id semper risus in hendrerit
</p>
<p>gravida rutrum. Mi bibendum neque egestas congue quisque egestas diam in. Pharetra sit amet
</p>
<p>aliquam id diam maecenas ultricies mi. Semper risus in hendrerit gravida rutrum quisque non
</p>
<p>tellus orci.
</p>
<p>Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non odio euismod. Mollis
</p>
<p>aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat nibh. Tristique senectus et
</p>
<p>netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet consectetur adipiscing elit ut
</p>
<p>aliquam purus. Sollicitudin ac orci phasellus egestas tellus. Gravida in fermentum et sollicitudin
</p>
<p>ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat maecenas volutpat blandit
</p>
<p>aliquam etiam erat. Sed blandit libero volutpat sed cras ornare arcu dui. Interdum posuere lorem
</p>
<p>ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus dolor purus non. Ac turpis egestas
</p>
<p>sed tempus urna. Nam aliquam sem et tortor consequat id porta.
</p>
<p>End of text
</p>
<p>
</p>
<p>  </p>
<p />
<div class="ocr">Start of text

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis nunc sed augue lacus. Et netus
et malesuada fames. Nunc sed id semper risus in hendrerit gravida. Scelerisque fermentum dui
faucibus in ornare quam viverra orci. Dolor morbi non arcu risus. Pharetra massa massa ultricies
mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl suscipit adipiscing. Auctor
augue mauris augue neque gravida in fermentum et sollicitudin. Sed vulputate mi sit amet mauris
commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget. Rutrum quisque non tellus
orci ac auctor augue.

Phasellus faucibus scelerisque eleifend donec pretium vulputate sapien nec sagittis. Vestibulum
rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Diam ut venenatis
tellus in metus vulputate eu scelerisque felis. Nulla malesuada pellentesque elit eget gravida cum
sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Dictum varius duis at
consectetur lorem donec massa sapien faucibus. Integer malesuada nunc vel risus. Sit amet
consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc non blandit massa enim nec dui
nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis at tellus at urna
condimentum mattis pellentesque id. Egestas tellus rutrum tellus pellentesque eu tincidunt tortor
aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus urna.

Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae ultricies leo. Gravida neque
convallis a cras. Enim nec dui nunc mattis. Non odio euismod lacinia at quis risus sed.
Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id semper risus in hendrerit
gravida rutrum. Mi bibendum neque egestas congue quisque egestas diam in. Pharetra sit amet
aliquam id diam maecenas ultricies mi. Semper risus in hendrerit gravida rutrum quisque non
tellus orci.

Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non odio euismod. Mollis
aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat nibh. Tristique senectus et
netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet consectetur adipiscing elit ut
aliquam purus. Sollicitudin ac orci phasellus egestas tellus. Gravida in fermentum et sollicitudin
ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat maecenas volutpat blandit
aliquam etiam erat. Sed blandit libero volutpat sed cras ornare arcu dui. Interdum posuere lorem
ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus dolor purus non. Ac turpis egestas
sed tempus urna. Nam aliquam sem et tortor consequat id porta.

End of text
</div>
</div>
<div class="page"><p />
<p>Start of image
</p>
<p>End of image
</p>
<p> </p>
<p />
<div class="ocr">Start of image

Pellentesque adipiscing commodo elit at imperdiet dui. Consectetur purus ut faucibus pulvinar. Tincidunt
praesent semper feugiat nibh sed pulvinar. Sagittis aliquam malesuada bibendum arcu vitae elementum
curabitur vitae. Velit euismod in pellentesque massa placerat duis. Fermentum et sollicitudin ac orci
phasellus egestas tellus. Ante in nibh mauris cursus mattis molestie. Commodo quis imperdiet massa
tincidunt nunc pulvinar sapien et ligula. Lorem mollis aliquam ut porttitor leo a diam sollicitudin

tempor. Amet aliquam id diam maecenas ultricies mi eget mauris pharetra. Ullamcorper dignissim cras
tincidunt lobortis feugiat vivamus at augue eget. Orci eu lobortis elementum nibh tellus. Id aliquet

lectus proin nibh nisl condimentum. Vitae elementum curabitur vitae nunc sed velit. Rnoncus dolor purus
non enim praesent elementum facilisis leo vel. Velit egestas dui id ornare arcu odio ut sem nulla. Purus
sit amet luctus venenatis lectus magna fringilla. Maecenas sed enim ut sem.

End of image
</div>
</div>
</body></html>

Process finished with exit code 0

[1] https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066
[2] TIKA-3258

On Mon, Jan 4, 2021 at 10:15 AM Peter Kronenberg <pe...@torch.ai>> wrote:
I’m still having a problem understanding the options on the PDFParser: OCRStrategy and ExtractInlineImages   These options appear to be getting ignored.  I’m not seeing any difference.

Here’s the code I’m using.  I’ve also attached the file I’m testing with.  Some of my confusion is similar to what was expressed here: http://apache-tika-users.1629097.n2.nabble.com/OCR-Strategy-ocr-only-extracts-also-text-td7574798.html, but there was never any resolution

When creating a set of parsers by adding to the ParseContext, how do I figure out what parser was ultimately used?

[public class TikaOCRParser {        private static final PDFParserConfig pdfConfig = new PDFParserConfig();      private static final TesseractOCRConfig tessConfig = new TesseractOCRConfig();        private static final AutoDetectParser parser = new AutoDetectParser();      private static final ParseContext parseContext = new ParseContext();        static {          parseContext.set(AutoDetectParser.class, parser);          parseContext.set(PDFParserConfig.class, pdfConfig);       //   parseContext.set(TesseractOCRConfig.class, tessConfig);      }        public static String parse(String file) throws TikaException, SAXException, IOException {          log.info(String.format("Tesseract path: %s, exists: %s", tessConfig.getTesseractPath(), new File(tessConfig.getTesseractPath()).exists()));          log.info(String.format("Tessdata path:  %s, exists: %s", tessConfig.getTessdataPath(), new File(tessConfig.getTessdataPath()).exists()));          log.info(String.format("Image Magick path: %s, exists: %s", tessConfig.getImageMagickPath(), new File(tessConfig.getImageMagickPath()).exists()));          log.info("enableImageProcessing: " + tessConfig.isEnableImageProcessing());          pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);          pdfConfig.setExtractInlineImages(false);          log.info("PDF Extract inline images: " + pdfConfig.getExtractInlineImages());          log.info("PDF OCR Strategy: " + pdfConfig.getOcrStrategy());          log.info("PDF OCR DPI: " + pdfConfig.getOcrDPI());          log.info("PDF Detect angles: " + pdfConfig.getDetectAngles());            ContentHandler handler = new BodyContentHandler(-1);          Metadata metadata = new Metadata();          try (TikaInputStream stream = TikaInputStream.get(new BufferedInputStream(new FileInputStream(file)))) {              log.info("calling parse on " + file);              parser.parse(stream, handler, metadata, parseContext);          }          //Arrays.stream(metadata.names()).filter(n -> !metadata.get(n).isEmpty()).forEach(n -> log.info(String.format("%s: %s", n, metadata.get(n))));          return handler.toString();      }            public static void main(String[] args) throws TikaException, SAXException, IOException {          String file = "c:\\testFiles\\Simple-text-image.docx";            System.out.println("Text: " + parse(file));      }  }]
From: Peter Kronenberg <pe...@torch.ai>>
Sent: Thursday, December 31, 2020 9:59 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: {EXTERNAL}OCR on PDFs

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
I’ve got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice.  Is there a way to avoid this?  Even if it has to make two passes, one for the straight text and then another for just the images

Re: {EXTERNAL}OCR on PDFs

Posted by Tim Allison <ta...@apache.org>.
To confirm, you aren't getting any OCR on "Simple-text-image.docx" or on
"Simple-test-image.pdf"?

As we mention on the wiki[1], there are two OCR strategies for PDFs, and
you have to pick one of them to get OCR to work with PDFs currently, but
see [2] for Tika 2.0.0:

a) extractInlineImages -- this extracts all inline images and runs OCR
against each image...this can be a bad idea
b) OcrStrategy -- this renders each page and then will run OCR against the
rendered image.

That said, neither of those should be affecting a docx image.

There's no need any more to specify the embedded parser in the
ParseContext.  We automatically use the AutoDetectParser as configured to
parse embedded documents.

If you use the ToXMLHandler, you'll get output about which parsers were
used, and you'll get <div> markup for page breaks and for ocr.

With this code:

PDFParserConfig config = new PDFParserConfig();
config.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, config);

Parser p = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
Path path = Paths.get("/..fill.in.../Simple-text-image.pdf");
try (InputStream tis = TikaInputStream.get(path, metadata)) {
    p.parse(tis, handler, metadata, parseContext);
}
System.out.println(handler.toString());

I get this:

<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:PDFVersion" content="1.7" />
<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:hasXFA" content="false" />
<meta name="access_permission:modify_annotations" content="true" />
<meta name="access_permission:can_print_degraded" content="true" />
<meta name="dc:creator" content="Peter Kronenberg" />
<meta name="language" content="en-US" />
<meta name="dcterms:created" content="2020-12-31T20:08:09Z" />
<meta name="Last-Modified" content="2020-12-31T20:08:09Z" />
<meta name="dcterms:modified" content="2020-12-31T20:08:09Z" />
<meta name="dc:format" content="application/pdf; version=1.7" />
<meta name="xmpMM:DocumentID"
content="uuid:2C3CDC87-5F9C-49B7-9E4F-0E16A7AE27BC" />
<meta name="Last-Save-Date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for
Microsoft 365" />
<meta name="access_permission:fill_in_form" content="true" />
<meta name="pdf:docinfo:modified" content="2020-12-31T20:08:09Z" />
<meta name="meta:save-date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:encrypted" content="false" />
<meta name="xmp:CreateDate" content="2020-12-31T15:08:09Z" />
<meta name="modified" content="2020-12-31T20:08:09Z" />
<meta name="Content-Length" content="47113" />
<meta name="pdf:hasMarkedContent" content="true" />
<meta name="Content-Type" content="application/pdf" />
<meta name="xmp:ModifyDate" content="2020-12-31T15:08:09Z" />
<meta name="pdf:docinfo:creator" content="Peter Kronenberg" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />
<meta name="X-Parsed-By" content="class
org.apache.tika.parser.ocr.TesseractOCRParser" />
<meta name="creator" content="Peter Kronenberg" />
<meta name="dc:language" content="en-US" />
<meta name="meta:author" content="Peter Kronenberg" />
<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="meta:creation-date" content="2020-12-31T20:08:09Z" />
<meta name="created" content="2020-12-31T20:08:09Z" />
<meta name="access_permission:extract_for_accessibility" content="true" />
<meta name="access_permission:assemble_document" content="true" />
<meta name="xmpTPg:NPages" content="2" />
<meta name="Creation-Date" content="2020-12-31T20:08:09Z" />
<meta name="resourceName" content="Simple-text-image.pdf" />
<meta name="pdf:hasXMP" content="true" />
<meta name="access_permission:extract_content" content="true" />
<meta name="access_permission:can_print" content="true" />
<meta name="Author" content="Peter Kronenberg" />
<meta name="producer" content="Microsoft® Word for Microsoft 365" />
<meta name="access_permission:can_modify" content="true" />
<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft
365" />
<meta name="pdf:docinfo:created" content="2020-12-31T20:08:09Z" />
<title></title>
</head>
<body><div class="page"><p />
<p>Start of text
</p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut
</p>
<p>labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis nunc
sed augue lacus. Et netus
</p>
<p>et malesuada fames. Nunc sed id semper risus in hendrerit gravida.
Scelerisque fermentum dui
</p>
<p>faucibus in ornare quam viverra orci. Dolor morbi non arcu risus.
Pharetra massa massa ultricies
</p>
<p>mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl suscipit
adipiscing. Auctor
</p>
<p>augue mauris augue neque gravida in fermentum et sollicitudin. Sed
vulputate mi sit amet mauris
</p>
<p>commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget. Rutrum
quisque non tellus
</p>
<p>orci ac auctor augue.
</p>
<p>Phasellus faucibus scelerisque eleifend donec pretium vulputate sapien
nec sagittis. Vestibulum
</p>
<p>rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt
lobortis. Diam ut venenatis
</p>
<p>tellus in metus vulputate eu scelerisque felis. Nulla malesuada
pellentesque elit eget gravida cum
</p>
<p>sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt
lobortis. Dictum varius duis at
</p>
<p>consectetur lorem donec massa sapien faucibus. Integer malesuada nunc
vel risus. Sit amet
</p>
<p>consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc non
blandit massa enim nec dui
</p>
<p>nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis at
tellus at urna
</p>
<p>condimentum mattis pellentesque id. Egestas tellus rutrum tellus
pellentesque eu tincidunt tortor
</p>
<p>aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus
urna.
</p>
<p>Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae
ultricies leo. Gravida neque
</p>
<p>convallis a cras. Enim nec dui nunc mattis. Non odio euismod lacinia at
quis risus sed.
</p>
<p>Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id
semper risus in hendrerit
</p>
<p>gravida rutrum. Mi bibendum neque egestas congue quisque egestas diam
in. Pharetra sit amet
</p>
<p>aliquam id diam maecenas ultricies mi. Semper risus in hendrerit gravida
rutrum quisque non
</p>
<p>tellus orci.
</p>
<p>Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non odio
euismod. Mollis
</p>
<p>aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat nibh.
Tristique senectus et
</p>
<p>netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet
consectetur adipiscing elit ut
</p>
<p>aliquam purus. Sollicitudin ac orci phasellus egestas tellus. Gravida in
fermentum et sollicitudin
</p>
<p>ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat
maecenas volutpat blandit
</p>
<p>aliquam etiam erat. Sed blandit libero volutpat sed cras ornare arcu
dui. Interdum posuere lorem
</p>
<p>ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus dolor
purus non. Ac turpis egestas
</p>
<p>sed tempus urna. Nam aliquam sem et tortor consequat id porta.
</p>
<p>End of text
</p>
<p>
</p>
<p>  </p>
<p />
<div class="ocr">Start of text

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod
tempor incididunt ut
labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis nunc sed
augue lacus. Et netus
et malesuada fames. Nunc sed id semper risus in hendrerit gravida.
Scelerisque fermentum dui
faucibus in ornare quam viverra orci. Dolor morbi non arcu risus. Pharetra
massa massa ultricies
mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl suscipit
adipiscing. Auctor
augue mauris augue neque gravida in fermentum et sollicitudin. Sed
vulputate mi sit amet mauris
commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget. Rutrum
quisque non tellus
orci ac auctor augue.

Phasellus faucibus scelerisque eleifend donec pretium vulputate sapien nec
sagittis. Vestibulum
rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt
lobortis. Diam ut venenatis
tellus in metus vulputate eu scelerisque felis. Nulla malesuada
pellentesque elit eget gravida cum
sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt
lobortis. Dictum varius duis at
consectetur lorem donec massa sapien faucibus. Integer malesuada nunc vel
risus. Sit amet
consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc non
blandit massa enim nec dui
nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis at
tellus at urna
condimentum mattis pellentesque id. Egestas tellus rutrum tellus
pellentesque eu tincidunt tortor
aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus urna.

Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae
ultricies leo. Gravida neque
convallis a cras. Enim nec dui nunc mattis. Non odio euismod lacinia at
quis risus sed.
Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id semper
risus in hendrerit
gravida rutrum. Mi bibendum neque egestas congue quisque egestas diam in.
Pharetra sit amet
aliquam id diam maecenas ultricies mi. Semper risus in hendrerit gravida
rutrum quisque non
tellus orci.

Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non odio
euismod. Mollis
aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat nibh.
Tristique senectus et
netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet
consectetur adipiscing elit ut
aliquam purus. Sollicitudin ac orci phasellus egestas tellus. Gravida in
fermentum et sollicitudin
ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat
maecenas volutpat blandit
aliquam etiam erat. Sed blandit libero volutpat sed cras ornare arcu dui.
Interdum posuere lorem
ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus dolor purus
non. Ac turpis egestas
sed tempus urna. Nam aliquam sem et tortor consequat id porta.

End of text
</div>
</div>
<div class="page"><p />
<p>Start of image
</p>
<p>End of image
</p>
<p> </p>
<p />
<div class="ocr">Start of image

Pellentesque adipiscing commodo elit at imperdiet dui. Consectetur purus ut
faucibus pulvinar. Tincidunt
praesent semper feugiat nibh sed pulvinar. Sagittis aliquam malesuada
bibendum arcu vitae elementum
curabitur vitae. Velit euismod in pellentesque massa placerat duis.
Fermentum et sollicitudin ac orci
phasellus egestas tellus. Ante in nibh mauris cursus mattis molestie.
Commodo quis imperdiet massa
tincidunt nunc pulvinar sapien et ligula. Lorem mollis aliquam ut porttitor
leo a diam sollicitudin

tempor. Amet aliquam id diam maecenas ultricies mi eget mauris pharetra.
Ullamcorper dignissim cras
tincidunt lobortis feugiat vivamus at augue eget. Orci eu lobortis
elementum nibh tellus. Id aliquet

lectus proin nibh nisl condimentum. Vitae elementum curabitur vitae nunc
sed velit. Rnoncus dolor purus
non enim praesent elementum facilisis leo vel. Velit egestas dui id ornare
arcu odio ut sem nulla. Purus
sit amet luctus venenatis lectus magna fringilla. Maecenas sed enim ut sem.

End of image
</div>
</div>
</body></html>

Process finished with exit code 0

[1]
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=109454066
[2] TIKA-3258

On Mon, Jan 4, 2021 at 10:15 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> I’m still having a problem understanding the options on the PDFParser:
> OCRStrategy and ExtractInlineImages   These options appear to be getting
> ignored.  I’m not seeing any difference.
>
>
>
> Here’s the code I’m using.  I’ve also attached the file I’m testing with.
> Some of my confusion is similar to what was expressed here:
> http://apache-tika-users.1629097.n2.nabble.com/OCR-Strategy-ocr-only-extracts-also-text-td7574798.html,
> but there was never any resolution
>
>
>
> When creating a set of parsers by adding to the ParseContext, how do I
> figure out what parser was ultimately used?
>
>
>
> [image: public class TikaOCRParser { private static final PDFParserConfig
> pdfConfig = new PDFParserConfig(); private static final TesseractOCRConfig
> tessConfig = new TesseractOCRConfig(); private static final
> AutoDetectParser parser = new AutoDetectParser(); private static final
> ParseContext parseContext = new ParseContext(); static {
> parseContext.set(AutoDetectParser.class, parser);
> parseContext.set(PDFParserConfig.class, pdfConfig); //
> parseContext.set(TesseractOCRConfig.class, tessConfig); } public static
> String parse(String file) throws TikaException, SAXException, IOException {
> log.info(String.format("Tesseract path: %s, exists: %s",
> tessConfig.getTesseractPath(), new
> File(tessConfig.getTesseractPath()).exists()));
> log.info(String.format("Tessdata path: %s, exists: %s",
> tessConfig.getTessdataPath(), new
> File(tessConfig.getTessdataPath()).exists()));
> log.info(String.format("Image Magick path: %s, exists: %s",
> tessConfig.getImageMagickPath(), new
> File(tessConfig.getImageMagickPath()).exists()));
> log.info("enableImageProcessing: " + tessConfig.isEnableImageProcessing());
> pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);
> pdfConfig.setExtractInlineImages(false); log.info("PDF Extract inline
> images: " + pdfConfig.getExtractInlineImages()); log.info("PDF OCR
> Strategy: " + pdfConfig.getOcrStrategy()); log.info("PDF OCR DPI: " +
> pdfConfig.getOcrDPI()); log.info("PDF Detect angles: " +
> pdfConfig.getDetectAngles()); ContentHandler handler = new
> BodyContentHandler(-1); Metadata metadata = new Metadata(); try
> (TikaInputStream stream = TikaInputStream.get(new BufferedInputStream(new
> FileInputStream(file)))) { log.info("calling parse on " + file);
> parser.parse(stream, handler, metadata, parseContext); }
> //Arrays.stream(metadata.names()).filter(n ->
> !metadata.get(n).isEmpty()).forEach(n -> log.info(String.format("%s: %s",
> n, metadata.get(n)))); return handler.toString(); } public static void
> main(String[] args) throws TikaException, SAXException, IOException {
> String file = "c:\\testFiles\\Simple-text-image.docx";
> System.out.println("Text: " + parse(file)); } }]
>
>
>
> *From:* Peter Kronenberg <pe...@torch.ai>
> *Sent:* Thursday, December 31, 2020 9:59 AM
> *To:* user@tika.apache.org
> *Subject:* {EXTERNAL}OCR on PDFs
>
>
>
> This email was sent from outside your organisation, yet is displaying the
> name of someone from your organisation. This often happens in phishing
> attempts. Please only interact with this email if you know its source and
> that the content is safe.
>
>
>
> CAUTION: This email originated from outside of the organization. DO NOT
> click links or open attachments unless you recognize the sender and know
> the content is safe.
>
> I’ve got Tika working with Tesseract on PDF files, but it seems that if I
> give it a PDF file that has both searchable text and images, the text is
> OCRed twice.  Is there a way to avoid this?  Even if it has to make two
> passes, one for the straight text and then another for just the images
>

RE: {EXTERNAL}OCR on PDFs

Posted by Peter Kronenberg <pe...@torch.ai>.
I'm still having a problem understanding the options on the PDFParser: OCRStrategy and ExtractInlineImages   These options appear to be getting ignored.  I'm not seeing any difference.

Here's the code I'm using.  I've also attached the file I'm testing with.  Some of my confusion is similar to what was expressed here: http://apache-tika-users.1629097.n2.nabble.com/OCR-Strategy-ocr-only-extracts-also-text-td7574798.html, but there was never any resolution

When creating a set of parsers by adding to the ParseContext, how do I figure out what parser was ultimately used?

[public class TikaOCRParser {        private static final PDFParserConfig pdfConfig = new PDFParserConfig();      private static final TesseractOCRConfig tessConfig = new TesseractOCRConfig();        private static final AutoDetectParser parser = new AutoDetectParser();      private static final ParseContext parseContext = new ParseContext();        static {          parseContext.set(AutoDetectParser.class, parser);          parseContext.set(PDFParserConfig.class, pdfConfig);       //   parseContext.set(TesseractOCRConfig.class, tessConfig);      }        public static String parse(String file) throws TikaException, SAXException, IOException {          log.info(String.format("Tesseract path: %s, exists: %s", tessConfig.getTesseractPath(), new File(tessConfig.getTesseractPath()).exists()));          log.info(String.format("Tessdata path:  %s, exists: %s", tessConfig.getTessdataPath(), new File(tessConfig.getTessdataPath()).exists()));          log.info(String.format("Image Magick path: %s, exists: %s", tessConfig.getImageMagickPath(), new File(tessConfig.getImageMagickPath()).exists()));          log.info("enableImageProcessing: " + tessConfig.isEnableImageProcessing());          pdfConfig.setOcrStrategy(PDFParserConfig.OCR_STRATEGY.OCR_AND_TEXT_EXTRACTION);          pdfConfig.setExtractInlineImages(false);          log.info("PDF Extract inline images: " + pdfConfig.getExtractInlineImages());          log.info("PDF OCR Strategy: " + pdfConfig.getOcrStrategy());          log.info("PDF OCR DPI: " + pdfConfig.getOcrDPI());          log.info("PDF Detect angles: " + pdfConfig.getDetectAngles());            ContentHandler handler = new BodyContentHandler(-1);          Metadata metadata = new Metadata();          try (TikaInputStream stream = TikaInputStream.get(new BufferedInputStream(new FileInputStream(file)))) {              log.info("calling parse on " + file);              parser.parse(stream, handler, metadata, parseContext);          }          //Arrays.stream(metadata.names()).filter(n -> !metadata.get(n).isEmpty()).forEach(n -> log.info(String.format("%s: %s", n, metadata.get(n))));          return handler.toString();      }            public static void main(String[] args) throws TikaException, SAXException, IOException {          String file = "c:\\testFiles\\Simple-text-image.docx";            System.out.println("Text: " + parse(file));      }  }]

From: Peter Kronenberg <pe...@torch.ai>
Sent: Thursday, December 31, 2020 9:59 AM
To: user@tika.apache.org
Subject: {EXTERNAL}OCR on PDFs

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
I've got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice.  Is there a way to avoid this?  Even if it has to make two passes, one for the straight text and then another for just the images

Re: OCR on PDFs

Posted by Nick Burch <ap...@gagravarr.org>.
On Thu, 31 Dec 2020, Peter Kronenberg wrote:
> I've got Tika working with Tesseract on PDF files, but it seems that if 
> I give it a PDF file that has both searchable text and images, the text 
> is OCRed twice.

Is this a PDF where some other tool has already done the OCR and stored 
the text it found behind the image?

If you highlight the image in Acrobat Reader, does it manage to select 
some text? If you copy and paste do you get text out?

Does this PDF have a mixture of "normal" text and images containing text, 
or is it all just "image text"?


Answers to these will affect how much Tika can help / be configured!

Thanks
Nick

RE: OCR on PDFs

Posted by Peter Kronenberg <pe...@torch.ai>.
Let me play around with the XML handler. That might help me to understand a bit more.

I might have caused some confusion because the code I pasted showed a DOCX, although I was asking about PDF.  I have the same thing as a Word file (in fact, I created it with word and then saved to PDF).  So I understand that the PDF options don’t effect DOCX.

Just to clarify, option 1 of Extracting the inline images and letting Tesseract run on enough image means that you are *only* processing the images and not the surrounding text?

Where does the OCR Strategy fit into this?

From: Tim Allison <ta...@apache.org>
Sent: Monday, January 4, 2021 11:11 AM
To: user@tika.apache.org
Subject: Re: OCR on PDFs

Sorry for not responding sooner.  The file that you attached helps me understand this question quite a bit.

The basic answer is: no, not yet, not generally.  The correct way to do OCR on PDFs might be to render the page without rendering the stored text and then run OCR on the page (minus text).  We're not yet doing this.

As mentioned, see our wiki (https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox) for the two main options of running OCR on PDFs.

On your specific test file (see code and output below) you can use option 1 for PDFs (e.g. extract inline images), and you get what you want, but this will not generalize because some PDFs can use thousands of images per page.


PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, config);

Parser p = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
Path path = Paths.get("/..../Simple-text-image.pdf");
try (InputStream tis = TikaInputStream.get(path, metadata)) {
    p.parse(tis, handler, metadata, parseContext);
}
System.out.println(handler.toString());



<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:PDFVersion" content="1.7" />
<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:hasXFA" content="false" />
<meta name="access_permission:modify_annotations" content="true" />
<meta name="access_permission:can_print_degraded" content="true" />
<meta name="dc:creator" content="Peter Kronenberg" />
<meta name="language" content="en-US" />
<meta name="dcterms:created" content="2020-12-31T20:08:09Z" />
<meta name="Last-Modified" content="2020-12-31T20:08:09Z" />
<meta name="dcterms:modified" content="2020-12-31T20:08:09Z" />
<meta name="dc:format" content="application/pdf; version=1.7" />
<meta name="xmpMM:DocumentID" content="uuid:2C3CDC87-5F9C-49B7-9E4F-0E16A7AE27BC" />
<meta name="Last-Save-Date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for Microsoft 365" />
<meta name="access_permission:fill_in_form" content="true" />
<meta name="pdf:docinfo:modified" content="2020-12-31T20:08:09Z" />
<meta name="meta:save-date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:encrypted" content="false" />
<meta name="xmp:CreateDate" content="2020-12-31T15:08:09Z" />
<meta name="modified" content="2020-12-31T20:08:09Z" />
<meta name="Content-Length" content="47113" />
<meta name="pdf:hasMarkedContent" content="true" />
<meta name="Content-Type" content="application/pdf" />
<meta name="xmp:ModifyDate" content="2020-12-31T15:08:09Z" />
<meta name="pdf:docinfo:creator" content="Peter Kronenberg" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />
<meta name="creator" content="Peter Kronenberg" />
<meta name="dc:language" content="en-US" />
<meta name="meta:author" content="Peter Kronenberg" />
<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="meta:creation-date" content="2020-12-31T20:08:09Z" />
<meta name="created" content="2020-12-31T20:08:09Z" />
<meta name="access_permission:extract_for_accessibility" content="true" />
<meta name="access_permission:assemble_document" content="true" />
<meta name="xmpTPg:NPages" content="2" />
<meta name="Creation-Date" content="2020-12-31T20:08:09Z" />
<meta name="resourceName" content="Simple-text-image.pdf" />
<meta name="pdf:hasXMP" content="true" />
<meta name="access_permission:extract_content" content="true" />
<meta name="access_permission:can_print" content="true" />
<meta name="Author" content="Peter Kronenberg" />
<meta name="producer" content="Microsoft® Word for Microsoft 365" />
<meta name="access_permission:can_modify" content="true" />
<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:docinfo:created" content="2020-12-31T20:08:09Z" />
<title></title>
</head>
<body><div class="page"><p />
<p>Start of text
</p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut
</p>
<p>labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis nunc sed augue lacus. Et netus
</p>
<p>et malesuada fames. Nunc sed id semper risus in hendrerit gravida. Scelerisque fermentum dui
</p>
<p>faucibus in ornare quam viverra orci. Dolor morbi non arcu risus. Pharetra massa massa ultricies
</p>
<p>mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl suscipit adipiscing. Auctor
</p>
<p>augue mauris augue neque gravida in fermentum et sollicitudin. Sed vulputate mi sit amet mauris
</p>
<p>commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget. Rutrum quisque non tellus
</p>
<p>orci ac auctor augue.
</p>
<p>Phasellus faucibus scelerisque eleifend donec pretium vulputate sapien nec sagittis. Vestibulum
</p>
<p>rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Diam ut venenatis
</p>
<p>tellus in metus vulputate eu scelerisque felis. Nulla malesuada pellentesque elit eget gravida cum
</p>
<p>sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt lobortis. Dictum varius duis at
</p>
<p>consectetur lorem donec massa sapien faucibus. Integer malesuada nunc vel risus. Sit amet
</p>
<p>consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc non blandit massa enim nec dui
</p>
<p>nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis at tellus at urna
</p>
<p>condimentum mattis pellentesque id. Egestas tellus rutrum tellus pellentesque eu tincidunt tortor
</p>
<p>aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus urna.
</p>
<p>Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae ultricies leo. Gravida neque
</p>
<p>convallis a cras. Enim nec dui nunc mattis. Non odio euismod lacinia at quis risus sed.
</p>
<p>Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id semper risus in hendrerit
</p>
<p>gravida rutrum. Mi bibendum neque egestas congue quisque egestas diam in. Pharetra sit amet
</p>
<p>aliquam id diam maecenas ultricies mi. Semper risus in hendrerit gravida rutrum quisque non
</p>
<p>tellus orci.
</p>
<p>Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non odio euismod. Mollis
</p>
<p>aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat nibh. Tristique senectus et
</p>
<p>netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet consectetur adipiscing elit ut
</p>
<p>aliquam purus. Sollicitudin ac orci phasellus egestas tellus. Gravida in fermentum et sollicitudin
</p>
<p>ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat maecenas volutpat blandit
</p>
<p>aliquam etiam erat. Sed blandit libero volutpat sed cras ornare arcu dui. Interdum posuere lorem
</p>
<p>ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus dolor purus non. Ac turpis egestas
</p>
<p>sed tempus urna. Nam aliquam sem et tortor consequat id porta.
</p>
<p>End of text
</p>
<p>
</p>
<p>  </p>
<p />
</div>
<div class="page"><p />
<p>Start of image
</p>
<p>End of image
</p>
<p> </p>
<p />
<img src="embedded:image0.png" alt="image0.png" /><div class="ocr">Pellentesque adipiscing commodo elit at imperdiet dui. Consectetur purus ut faucibus pulvinar. Tincidunt
praesent semper feugiat nibh sed pulvinar. Sagittis aliquam malesuada bibendum arcu vitae elementum
curabitur vitae. Velit euismod in pellentesque massa placerat duis. Fermentum et sollicitudin ac orci
phasellus egestas tellus. Ante in nibh mauris cursus mattis molestie. Commodo quis imperdiet massa
fincidunt nunc pulvinar sapien et ligula. Lorem mollis aliquam ut porttitor leo a diam sollicitudin

tempor. Amet aliquam id diam maecenas ultricies mi eget mauris pharetra. Ullamcorper dignissim cras
tincidunt lobortis feugiat vivamus at augue eget. Orci eu lobortis elementum nibh tellus. Id aliquet

lectus proin nibh nis! condimentum. Vitae elementum curabitur vitae nunc sed velit. Rhoncus dolor purus
non enim praesent elementum facilisis leo vel. Velit egestas dui id ornare arcu odio ut sem nulla. Purus
sit amet luctus venenatis lectus magna fringilla. Maecenas sed enim ut sem.
</div>

</div>
</body></html>

On Thu, Dec 31, 2020 at 9:58 AM Peter Kronenberg <pe...@torch.ai>> wrote:
I’ve got Tika working with Tesseract on PDF files, but it seems that if I give it a PDF file that has both searchable text and images, the text is OCRed twice.  Is there a way to avoid this?  Even if it has to make two passes, one for the straight text and then another for just the images

Re: OCR on PDFs

Posted by Tim Allison <ta...@apache.org>.
Sorry for not responding sooner.  The file that you attached helps me
understand this question quite a bit.

The basic answer is: no, not yet, not generally.  The correct way to do OCR
on PDFs might be to render the page without rendering the stored text and
then run OCR on the page (minus text).  We're not yet doing this.

As mentioned, see our wiki (
https://cwiki.apache.org/confluence/display/tika/PDFParser%20(Apache%20PDFBox)
for the two main options of running OCR on PDFs.

On your specific test file (see code and output below) you can use option 1
for PDFs (e.g. extract inline images), and you get what you want, but this
will not generalize because some PDFs can use thousands of images per
page.

PDFParserConfig config = new PDFParserConfig();
config.setExtractInlineImages(true);
ParseContext parseContext = new ParseContext();
parseContext.set(PDFParserConfig.class, config);

Parser p = new AutoDetectParser();
ContentHandler handler = new ToXMLContentHandler();
Metadata metadata = new Metadata();
Path path = Paths.get("/..../Simple-text-image.pdf");
try (InputStream tis = TikaInputStream.get(path, metadata)) {
    p.parse(tis, handler, metadata, parseContext);
}
System.out.println(handler.toString());


<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:PDFVersion" content="1.7" />
<meta name="xmp:CreatorTool" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:hasXFA" content="false" />
<meta name="access_permission:modify_annotations" content="true" />
<meta name="access_permission:can_print_degraded" content="true" />
<meta name="dc:creator" content="Peter Kronenberg" />
<meta name="language" content="en-US" />
<meta name="dcterms:created" content="2020-12-31T20:08:09Z" />
<meta name="Last-Modified" content="2020-12-31T20:08:09Z" />
<meta name="dcterms:modified" content="2020-12-31T20:08:09Z" />
<meta name="dc:format" content="application/pdf; version=1.7" />
<meta name="xmpMM:DocumentID"
content="uuid:2C3CDC87-5F9C-49B7-9E4F-0E16A7AE27BC" />
<meta name="Last-Save-Date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:docinfo:creator_tool" content="Microsoft® Word for
Microsoft 365" />
<meta name="access_permission:fill_in_form" content="true" />
<meta name="pdf:docinfo:modified" content="2020-12-31T20:08:09Z" />
<meta name="meta:save-date" content="2020-12-31T20:08:09Z" />
<meta name="pdf:encrypted" content="false" />
<meta name="xmp:CreateDate" content="2020-12-31T15:08:09Z" />
<meta name="modified" content="2020-12-31T20:08:09Z" />
<meta name="Content-Length" content="47113" />
<meta name="pdf:hasMarkedContent" content="true" />
<meta name="Content-Type" content="application/pdf" />
<meta name="xmp:ModifyDate" content="2020-12-31T15:08:09Z" />
<meta name="pdf:docinfo:creator" content="Peter Kronenberg" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.DefaultParser" />
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser" />
<meta name="creator" content="Peter Kronenberg" />
<meta name="dc:language" content="en-US" />
<meta name="meta:author" content="Peter Kronenberg" />
<meta name="pdf:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="meta:creation-date" content="2020-12-31T20:08:09Z" />
<meta name="created" content="2020-12-31T20:08:09Z" />
<meta name="access_permission:extract_for_accessibility" content="true" />
<meta name="access_permission:assemble_document" content="true" />
<meta name="xmpTPg:NPages" content="2" />
<meta name="Creation-Date" content="2020-12-31T20:08:09Z" />
<meta name="resourceName" content="Simple-text-image.pdf" />
<meta name="pdf:hasXMP" content="true" />
<meta name="access_permission:extract_content" content="true" />
<meta name="access_permission:can_print" content="true" />
<meta name="Author" content="Peter Kronenberg" />
<meta name="producer" content="Microsoft® Word for Microsoft 365" />
<meta name="access_permission:can_modify" content="true" />
<meta name="pdf:docinfo:producer" content="Microsoft® Word for Microsoft 365" />
<meta name="pdf:docinfo:created" content="2020-12-31T20:08:09Z" />
<title></title>
</head>
<body><div class="page"><p />
<p>Start of text
</p>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do
eiusmod tempor incididunt ut
</p>
<p>labore et dolore magna aliqua. A arcu cursus vitae congue. Iaculis
nunc sed augue lacus. Et netus
</p>
<p>et malesuada fames. Nunc sed id semper risus in hendrerit gravida.
Scelerisque fermentum dui
</p>
<p>faucibus in ornare quam viverra orci. Dolor morbi non arcu risus.
Pharetra massa massa ultricies
</p>
<p>mi quis. Vitae tempus quam pellentesque nec nam. Sit amet nisl
suscipit adipiscing. Auctor
</p>
<p>augue mauris augue neque gravida in fermentum et sollicitudin. Sed
vulputate mi sit amet mauris
</p>
<p>commodo. Velit sed ullamcorper morbi tincidunt ornare massa eget.
Rutrum quisque non tellus
</p>
<p>orci ac auctor augue.
</p>
<p>Phasellus faucibus scelerisque eleifend donec pretium vulputate
sapien nec sagittis. Vestibulum
</p>
<p>rhoncus est pellentesque elit ullamcorper dignissim cras tincidunt
lobortis. Diam ut venenatis
</p>
<p>tellus in metus vulputate eu scelerisque felis. Nulla malesuada
pellentesque elit eget gravida cum
</p>
<p>sociis. Est pellentesque elit ullamcorper dignissim cras tincidunt
lobortis. Dictum varius duis at
</p>
<p>consectetur lorem donec massa sapien faucibus. Integer malesuada
nunc vel risus. Sit amet
</p>
<p>consectetur adipiscing elit duis tristique sollicitudin nibh. Nunc
non blandit massa enim nec dui
</p>
<p>nunc mattis enim. Quam viverra orci sagittis eu volutpat odio. Duis
at tellus at urna
</p>
<p>condimentum mattis pellentesque id. Egestas tellus rutrum tellus
pellentesque eu tincidunt tortor
</p>
<p>aliquam nulla. Netus et malesuada fames ac turpis egestas sed tempus urna.
</p>
<p>Ut sem nulla pharetra diam sit amet nisl suscipit. Mus mauris vitae
ultricies leo. Gravida neque
</p>
<p>convallis a cras. Enim nec dui nunc mattis. Non odio euismod
lacinia at quis risus sed.
</p>
<p>Commodo viverra maecenas accumsan lacus vel facilisis. Nunc sed id
semper risus in hendrerit
</p>
<p>gravida rutrum. Mi bibendum neque egestas congue quisque egestas
diam in. Pharetra sit amet
</p>
<p>aliquam id diam maecenas ultricies mi. Semper risus in hendrerit
gravida rutrum quisque non
</p>
<p>tellus orci.
</p>
<p>Bibendum at varius vel pharetra vel. Lacus vestibulum sed arcu non
odio euismod. Mollis
</p>
<p>aliquam ut porttitor leo a diam. Tincidunt praesent semper feugiat
nibh. Tristique senectus et
</p>
<p>netus et malesuada fames ac turpis egestas. Ipsum dolor sit amet
consectetur adipiscing elit ut
</p>
<p>aliquam purus. Sollicitudin ac orci phasellus egestas tellus.
Gravida in fermentum et sollicitudin
</p>
<p>ac orci phasellus egestas. Congue quisque egestas diam in. Volutpat
maecenas volutpat blandit
</p>
<p>aliquam etiam erat. Sed blandit libero volutpat sed cras ornare
arcu dui. Interdum posuere lorem
</p>
<p>ipsum dolor sit. Lectus magna fringilla urna porttitor rhoncus
dolor purus non. Ac turpis egestas
</p>
<p>sed tempus urna. Nam aliquam sem et tortor consequat id porta.
</p>
<p>End of text
</p>
<p>
</p>
<p>  </p>
<p />
</div>
<div class="page"><p />
<p>Start of image
</p>
<p>End of image
</p>
<p> </p>
<p />
<img src="embedded:image0.png" alt="image0.png" /><div
class="ocr">Pellentesque adipiscing commodo elit at imperdiet dui.
Consectetur purus ut faucibus pulvinar. Tincidunt
praesent semper feugiat nibh sed pulvinar. Sagittis aliquam malesuada
bibendum arcu vitae elementum
curabitur vitae. Velit euismod in pellentesque massa placerat duis.
Fermentum et sollicitudin ac orci
phasellus egestas tellus. Ante in nibh mauris cursus mattis molestie.
Commodo quis imperdiet massa
fincidunt nunc pulvinar sapien et ligula. Lorem mollis aliquam ut
porttitor leo a diam sollicitudin

tempor. Amet aliquam id diam maecenas ultricies mi eget mauris
pharetra. Ullamcorper dignissim cras
tincidunt lobortis feugiat vivamus at augue eget. Orci eu lobortis
elementum nibh tellus. Id aliquet

lectus proin nibh nis! condimentum. Vitae elementum curabitur vitae
nunc sed velit. Rhoncus dolor purus
non enim praesent elementum facilisis leo vel. Velit egestas dui id
ornare arcu odio ut sem nulla. Purus
sit amet luctus venenatis lectus magna fringilla. Maecenas sed enim ut sem.
</div>

</div>
</body></html>


On Thu, Dec 31, 2020 at 9:58 AM Peter Kronenberg <pe...@torch.ai>
wrote:

> I’ve got Tika working with Tesseract on PDF files, but it seems that if I
> give it a PDF file that has both searchable text and images, the text is
> OCRed twice.  Is there a way to avoid this?  Even if it has to make two
> passes, one for the straight text and then another for just the images
>