You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/04/12 01:20:56 UTC

RE: Parsing PDF file - setting threshold of unmapped characters

I’ve been thinking about this and I think it would be a good idea to change the comparison of unmapped characters to a percentage.  For example, you suggested


unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or something?



The percentage could be configurable.



Another thought I had to was to have an AUTO_BEST and AUTO_FAST.  AUTO_FAST would have a higher threshold of Unmapped Characters, so that in most cases, it would just extract text and not use OCR.  The performance overhead of OCR is very high for not a lot of benefit given that it extracts 99% of the text.

AUTO_BEST would have a lower threshold before OCR is triggered.



Or just keep AUTO and allow the threshold to be configured, either by number of characters or percentage.  The only downside to this is that the user would have to understand it a little more to be able to set the threshold properly, instead of AUTO just working magically



What do you thin/


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <ta...@apache.org>
Sent: Monday, April 5, 2021 1:49 PM
To: Peter Kronenberg <pe...@torch.ai>
Cc: user@tika.apache.org
Subject: Re: Parsing PDF file

Y. You understand perfectly!

I want "auto" to be the best it can be and most generally applicable across use cases.  For users who want high performance/better control, you might parse the PDF first with NO_OCR, and then make the determination on which pages to run OCR based on those statistics pulled out in the first parse.  Another key statistic in the decision would be the out of vocabulary measurement that you can get with an integration with tika-eval.

So, in short, if there are clear, provable, general improvements to AUTO, we should make them.  If you want more refined control, let us know if the current metadata can be improved to help you develop your application for your use cases.

On Mon, Apr 5, 2021 at 1:06 PM Peter Kronenberg <pe...@torch.ai>> wrote:
You’re right that OCRing would result in slightly more accurate results in this case.  But the performance penalty is high.  Wondering if there is some intermediate option.

I think I understand now why you are separately looking for unmapped characters as well as total characters.  If total characters is low, we assume the page is an image and OCR.  But if unmapped characters is high, it might still be straight text, but the unmapped characters will essentially result in unreadable characters

From: Tim Allison <ta...@apache.org>>
Sent: Monday, April 5, 2021 11:39 AM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Parsing PDF file

As for the metadata, we should add unique.  Given that multiple parsers can hit the same file, we need to record all of them (in this case: default, pdf, tesseract).

As for tweaking the settings...I'm not sure as I look at the extracted text more.  There are quite a few bad ligatures /unmapped unicode chars which would render search for, e.g. "efficient", "affairs" useless.

On Mon, Apr 5, 2021 at 10:40 AM Peter Kronenberg <pe...@torch.ai>> wrote:
Yes, I think tweaking the criteria for Auto is a good idea.
And if the parser list was a Set, that would automatically eliminate dups

From: Tim Allison <ta...@apache.org>>
Sent: Monday, April 5, 2021 10:15 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Fwd: Parsing PDF file

It looks like the ligatures don't have unicode mappings:

"Division of Monetary A�airs"


if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10)

The issue is that this file has > 10 unmapped unicode chars per page.

We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or something?

We should also probably check to see if a parser is in the parsed by list before re-adding it?



0: pdf:charsPerPage : 1579
0: pdf:charsPerPage : 1891
0: pdf:charsPerPage : 2283
0: pdf:charsPerPage : 2224
0: pdf:charsPerPage : 1619
0: pdf:charsPerPage : 2177
0: pdf:charsPerPage : 1626
0: pdf:charsPerPage : 1313
0: pdf:charsPerPage : 1652
0: pdf:charsPerPage : 1493
0: pdf:charsPerPage : 1136
0: pdf:charsPerPage : 1477
0: pdf:charsPerPage : 1264
0: pdf:charsPerPage : 1994
0: pdf:charsPerPage : 2062
0: pdf:charsPerPage : 1756
0: pdf:charsPerPage : 2007
0: pdf:charsPerPage : 2202
0: pdf:charsPerPage : 2105
0: pdf:charsPerPage : 2106
0: pdf:charsPerPage : 1895
0: pdf:charsPerPage : 1978
0: pdf:charsPerPage : 1826
0: pdf:charsPerPage : 1742
0: pdf:charsPerPage : 2073
0: pdf:charsPerPage : 1882
0: pdf:charsPerPage : 1497
0: pdf:charsPerPage : 282
0: pdf:charsPerPage : 606
0: pdf:charsPerPage : 948
0: pdf:charsPerPage : 418
0: pdf:charsPerPage : 266
0: pdf:charsPerPage : 830
0: pdf:charsPerPage : 259
0: pdf:charsPerPage : 716
0: pdf:charsPerPage : 961
0: pdf:charsPerPage : 1325
0: pdf:charsPerPage : 1478
0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998 Radical Eye Software
0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows
0: pdf:docinfo:title : Inel4shannon.dvi
0: pdf:encrypted : false
0: pdf:hasMarkedContent : false
0: pdf:hasXFA : false
0: pdf:hasXMP : false
0: pdf:producer : Acrobat Distiller 3.01 for Windows
0: pdf:unmappedUnicodeCharsPerPage : 109
0: pdf:unmappedUnicodeCharsPerPage : 120
0: pdf:unmappedUnicodeCharsPerPage : 113
0: pdf:unmappedUnicodeCharsPerPage : 120
0: pdf:unmappedUnicodeCharsPerPage : 94
0: pdf:unmappedUnicodeCharsPerPage : 112
0: pdf:unmappedUnicodeCharsPerPage : 178
0: pdf:unmappedUnicodeCharsPerPage : 74
0: pdf:unmappedUnicodeCharsPerPage : 132
0: pdf:unmappedUnicodeCharsPerPage : 189
0: pdf:unmappedUnicodeCharsPerPage : 165
0: pdf:unmappedUnicodeCharsPerPage : 145
0: pdf:unmappedUnicodeCharsPerPage : 132
0: pdf:unmappedUnicodeCharsPerPage : 186
0: pdf:unmappedUnicodeCharsPerPage : 162
0: pdf:unmappedUnicodeCharsPerPage : 145
0: pdf:unmappedUnicodeCharsPerPage : 119
0: pdf:unmappedUnicodeCharsPerPage : 138
0: pdf:unmappedUnicodeCharsPerPage : 115
0: pdf:unmappedUnicodeCharsPerPage : 99
0: pdf:unmappedUnicodeCharsPerPage : 107
0: pdf:unmappedUnicodeCharsPerPage : 108
0: pdf:unmappedUnicodeCharsPerPage : 116
0: pdf:unmappedUnicodeCharsPerPage : 174
0: pdf:unmappedUnicodeCharsPerPage : 138
0: pdf:unmappedUnicodeCharsPerPage : 101
0: pdf:unmappedUnicodeCharsPerPage : 61
0: pdf:unmappedUnicodeCharsPerPage : 90
0: pdf:unmappedUnicodeCharsPerPage : 239
0: pdf:unmappedUnicodeCharsPerPage : 614
0: pdf:unmappedUnicodeCharsPerPage : 216
0: pdf:unmappedUnicodeCharsPerPage : 101
0: pdf:unmappedUnicodeCharsPerPage : 502
0: pdf:unmappedUnicodeCharsPerPage : 103
0: pdf:unmappedUnicodeCharsPerPage : 427
0: pdf:unmappedUnicodeCharsPerPage : 629
0: pdf:unmappedUnicodeCharsPerPage : 347
0: pdf:unmappedUnicodeCharsPerPage : 327

On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
Yes, 2.x

From: Tim Allison <ta...@apache.org>>
Sent: Monday, April 5, 2021 9:54 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Parsing PDF file

Tika 2.x? Looking now.

On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <pe...@torch.ai>> wrote:
If I use OCRStrategy=no_ocr, the time it takes to process is orders of magnitude faster and I don’t see the calls to OCRParser (obviously) Why is it taking so long with auto?  If the page does not meet the criteria for OCR, then it shouldn’t be calling OCR at all, right?

 "X-TIKA:Parsed-By": "[org.apache.tika.parser.CompositeParser, org.apache.tika.parser.pdf.PDFParser]",


From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, April 5, 2021 8:48 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: {EXTERNAL}Parsing PDF file

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Correction: I see one instance of PDFParser at the beginning, but why does it then alternate between OCRParser and CompositeParser?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, April 5, 2021 8:41 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: {EXTERNAL}Parsing PDF file

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
Parsing the attached PDF file.   It is a text file, not scanned.  I’m using OCR_Strategy=Auto, extractInlineImages=false

The output contains the following in the metadata.  I’m wondering 2 things.  First, why don’t I see PDFParser?
And 2nd, why does it keep calling the TesseractOCRParser?  Once it determines that it is a PDF file, wouldn’t it stick with that?
I’m asking because it seems to take longer to parse than I would expect and I’m wondering if the OCRParser is adding extra overhead


"X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, org.apache.tika.parser.pdf.PDFParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser]

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
4303 W. 119th St., Leawood, KS 66209<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=14c17a0e2f574c30b54332f7c4081ca7>
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>

RE: Parsing PDF file - setting threshold of unmapped characters

Posted by Peter Kronenberg <pe...@torch.ai>.

The numbers I suggested were actually from Tim a week or 2 ago.   Of course, the idea is to allow the user to adjust them, so if the default numbers don't work for a particular scenario, they can be changed.

It sounds like the best solution for the best vs fast discussion is to just make it an option and let the user decide

I know that Tesseract has something similar with their best and fast models

Peter Kronenberg  |  SENIOR AI ANALYTIC ENGINEER 
C: 703.887.5623

4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI


-----Original Message-----
From: Nick Burch <ap...@gagravarr.org> 
Sent: Wednesday, April 14, 2021 10:02 AM
To: user@tika.apache.org
Subject: RE: Parsing PDF file - setting threshold of unmapped characters

On Wed, 14 Apr 2021, Peter Kronenberg wrote:
> Anyone have any thoughts on this?

I think both an absolute and a percentage would be good, but I don't have enough experience to comment on your suggested numbers for those two thresholds, sorry!

Your idea on best vs fast touches on much older discussions on what to do when we have multiple possible parsers available. For example, an external program that's slow but official and very reliable, or a java library that's quick but misses some edge cases. We never did manage to reach a conclusion on that though...

Nick


> Subject: RE: Parsing PDF file - setting threshold of unmapped 
> characters
>
> I’ve been thinking about this and I think it would be a good idea to 
> change the comparison of unmapped characters to a percentage.  For 
> example, you suggested
>
>
> unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or something?
>
>
>
> The percentage could be configurable.
>
>
>
> Another thought I had to was to have an AUTO_BEST and AUTO_FAST.  AUTO_FAST would have a higher threshold of Unmapped Characters, so that in most cases, it would just extract text and not use OCR.  The performance overhead of OCR is very high for not a lot of benefit given that it extracts 99% of the text.
>
> AUTO_BEST would have a lower threshold before OCR is triggered.
>
>
>
> Or just keep AUTO and allow the threshold to be configured, either by 
> number of characters or percentage.  The only downside to this is that 
> the user would have to understand it a little more to be able to set 
> the threshold properly, instead of AUTO just working magically
>
>
>
> What do you thin/
>
>
> Peter Kronenberg  |  Senior AI Analytic ENGINEER
> C: 703.887.5623
> [Torch 
> AI]<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3d
> y50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU
> 2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=a835294803a7452c9dd75b10
> 08ffebe7>
> 4303 W. 119th St., Leawood, KS 66209
> https://us-east-2.protection.sophos.com?d=torch.ai&u=d3d3LlRPUkNILkFJ&
> i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=Nm1Pd1NUek94MUNheHppZ0RpaUZ4RVlYe
> mhyTlhSa1M3Ly9FUFhXeDc5dz0=&h=a835294803a7452c9dd75b1008ffebe7<https:/
> /us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5h
> aS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUy
> cFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=a835294803a7452c9dd75b1008ffebe7>
>
>
> From: Tim Allison <ta...@apache.org>>
> Sent: Monday, April 5, 2021 1:49 PM
> To: Peter Kronenberg 
> <pe...@torch.ai>>
> Cc: user@tika.apache.org<ma...@tika.apache.org>
> Subject: Re: Parsing PDF file
>
> Y. You understand perfectly!
>
> I want "auto" to be the best it can be and most generally applicable across use cases.  For users who want high performance/better control, you might parse the PDF first with NO_OCR, and then make the determination on which pages to run OCR based on those statistics pulled out in the first parse.  Another key statistic in the decision would be the out of vocabulary measurement that you can get with an integration with tika-eval.
>
> So, in short, if there are clear, provable, general improvements to AUTO, we should make them.  If you want more refined control, let us know if the current metadata can be improved to help you develop your application for your use cases.
>
> On Mon, Apr 5, 2021 at 1:06 PM Peter Kronenberg <pe...@torch.ai>> wrote:
> You’re right that OCRing would result in slightly more accurate results in this case.  But the performance penalty is high.  Wondering if there is some intermediate option.
>
> I think I understand now why you are separately looking for unmapped 
> characters as well as total characters.  If total characters is low, 
> we assume the page is an image and OCR.  But if unmapped characters is 
> high, it might still be straight text, but the unmapped characters 
> will essentially result in unreadable characters
>
> From: Tim Allison <ta...@apache.org>>
> Sent: Monday, April 5, 2021 11:39 AM
> To: Peter Kronenberg 
> <pe...@torch.ai>>
> Cc: user@tika.apache.org<ma...@tika.apache.org>
> Subject: Re: Parsing PDF file
>
> As for the metadata, we should add unique.  Given that multiple parsers can hit the same file, we need to record all of them (in this case: default, pdf, tesseract).
>
> As for tweaking the settings...I'm not sure as I look at the extracted text more.  There are quite a few bad ligatures /unmapped unicode chars which would render search for, e.g. "efficient", "affairs" useless.
>
> On Mon, Apr 5, 2021 at 10:40 AM Peter Kronenberg <pe...@torch.ai>> wrote:
> Yes, I think tweaking the criteria for Auto is a good idea.
> And if the parser list was a Set, that would automatically eliminate 
> dups
>
> From: Tim Allison <ta...@apache.org>>
> Sent: Monday, April 5, 2021 10:15 AM
> To: user@tika.apache.org<ma...@tika.apache.org>
> Subject: Fwd: Parsing PDF file
>
> It looks like the ligatures don't have unicode mappings:
>
> "Division of Monetary A???airs"
>
>
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10)
>
> The issue is that this file has > 10 unmapped unicode chars per page.
>
> We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or something?
>
> We should also probably check to see if a parser is in the parsed by list before re-adding it?
>
>
>
> 0: pdf:charsPerPage : 1579
> 0: pdf:charsPerPage : 1891
> 0: pdf:charsPerPage : 2283
> 0: pdf:charsPerPage : 2224
> 0: pdf:charsPerPage : 1619
> 0: pdf:charsPerPage : 2177
> 0: pdf:charsPerPage : 1626
> 0: pdf:charsPerPage : 1313
> 0: pdf:charsPerPage : 1652
> 0: pdf:charsPerPage : 1493
> 0: pdf:charsPerPage : 1136
> 0: pdf:charsPerPage : 1477
> 0: pdf:charsPerPage : 1264
> 0: pdf:charsPerPage : 1994
> 0: pdf:charsPerPage : 2062
> 0: pdf:charsPerPage : 1756
> 0: pdf:charsPerPage : 2007
> 0: pdf:charsPerPage : 2202
> 0: pdf:charsPerPage : 2105
> 0: pdf:charsPerPage : 2106
> 0: pdf:charsPerPage : 1895
> 0: pdf:charsPerPage : 1978
> 0: pdf:charsPerPage : 1826
> 0: pdf:charsPerPage : 1742
> 0: pdf:charsPerPage : 2073
> 0: pdf:charsPerPage : 1882
> 0: pdf:charsPerPage : 1497
> 0: pdf:charsPerPage : 282
> 0: pdf:charsPerPage : 606
> 0: pdf:charsPerPage : 948
> 0: pdf:charsPerPage : 418
> 0: pdf:charsPerPage : 266
> 0: pdf:charsPerPage : 830
> 0: pdf:charsPerPage : 259
> 0: pdf:charsPerPage : 716
> 0: pdf:charsPerPage : 961
> 0: pdf:charsPerPage : 1325
> 0: pdf:charsPerPage : 1478
> 0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998 
> Radical Eye Software
> 0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows
> 0: pdf:docinfo:title : Inel4shannon.dvi
> 0: pdf:encrypted : false
> 0: pdf:hasMarkedContent : false
> 0: pdf:hasXFA : false
> 0: pdf:hasXMP : false
> 0: pdf:producer : Acrobat Distiller 3.01 for Windows
> 0: pdf:unmappedUnicodeCharsPerPage : 109
> 0: pdf:unmappedUnicodeCharsPerPage : 120
> 0: pdf:unmappedUnicodeCharsPerPage : 113
> 0: pdf:unmappedUnicodeCharsPerPage : 120
> 0: pdf:unmappedUnicodeCharsPerPage : 94
> 0: pdf:unmappedUnicodeCharsPerPage : 112
> 0: pdf:unmappedUnicodeCharsPerPage : 178
> 0: pdf:unmappedUnicodeCharsPerPage : 74
> 0: pdf:unmappedUnicodeCharsPerPage : 132
> 0: pdf:unmappedUnicodeCharsPerPage : 189
> 0: pdf:unmappedUnicodeCharsPerPage : 165
> 0: pdf:unmappedUnicodeCharsPerPage : 145
> 0: pdf:unmappedUnicodeCharsPerPage : 132
> 0: pdf:unmappedUnicodeCharsPerPage : 186
> 0: pdf:unmappedUnicodeCharsPerPage : 162
> 0: pdf:unmappedUnicodeCharsPerPage : 145
> 0: pdf:unmappedUnicodeCharsPerPage : 119
> 0: pdf:unmappedUnicodeCharsPerPage : 138
> 0: pdf:unmappedUnicodeCharsPerPage : 115
> 0: pdf:unmappedUnicodeCharsPerPage : 99
> 0: pdf:unmappedUnicodeCharsPerPage : 107
> 0: pdf:unmappedUnicodeCharsPerPage : 108
> 0: pdf:unmappedUnicodeCharsPerPage : 116
> 0: pdf:unmappedUnicodeCharsPerPage : 174
> 0: pdf:unmappedUnicodeCharsPerPage : 138
> 0: pdf:unmappedUnicodeCharsPerPage : 101
> 0: pdf:unmappedUnicodeCharsPerPage : 61
> 0: pdf:unmappedUnicodeCharsPerPage : 90
> 0: pdf:unmappedUnicodeCharsPerPage : 239
> 0: pdf:unmappedUnicodeCharsPerPage : 614
> 0: pdf:unmappedUnicodeCharsPerPage : 216
> 0: pdf:unmappedUnicodeCharsPerPage : 101
> 0: pdf:unmappedUnicodeCharsPerPage : 502
> 0: pdf:unmappedUnicodeCharsPerPage : 103
> 0: pdf:unmappedUnicodeCharsPerPage : 427
> 0: pdf:unmappedUnicodeCharsPerPage : 629
> 0: pdf:unmappedUnicodeCharsPerPage : 347
> 0: pdf:unmappedUnicodeCharsPerPage : 327
>
> On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
> Yes, 2.x
>
> From: Tim Allison <ta...@apache.org>>
> Sent: Monday, April 5, 2021 9:54 AM
> To: user@tika.apache.org<ma...@tika.apache.org>
> Subject: Re: Parsing PDF file
>
> Tika 2.x? Looking now.
>
> On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <pe...@torch.ai>> wrote:
> If I use OCRStrategy=no_ocr, the time it takes to process is orders of magnitude faster and I don’t see the calls to OCRParser (obviously) Why is it taking so long with auto?  If the page does not meet the criteria for OCR, then it shouldn’t be calling OCR at all, right?
>
> "X-TIKA:Parsed-By": "[org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.pdf.PDFParser]",
>
>
> From: Peter Kronenberg 
> <pe...@torch.ai>>
> Sent: Monday, April 5, 2021 8:48 AM
> To: user@tika.apache.org<ma...@tika.apache.org>
> Subject: RE: {EXTERNAL}Parsing PDF file
>
> This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.
>
> Correction: I see one instance of PDFParser at the beginning, but why does it then alternate between OCRParser and CompositeParser?
>
> From: Peter Kronenberg 
> <pe...@torch.ai>>
> Sent: Monday, April 5, 2021 8:41 AM
> To: user@tika.apache.org<ma...@tika.apache.org>
> Subject: {EXTERNAL}Parsing PDF file
>
> This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.
>
> CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
> Parsing the attached PDF file.   It is a text file, not scanned.  I’m using OCR_Strategy=Auto, extractInlineImages=false
>
> The output contains the following in the metadata.  I’m wondering 2 things.  First, why don’t I see PDFParser?
> And 2nd, why does it keep calling the TesseractOCRParser?  Once it determines that it is a PDF file, wouldn’t it stick with that?
> I’m asking because it seems to take longer to parse than I would 
> expect and I’m wondering if the OCRParser is adding extra overhead
>
>
> "X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.pdf.PDFParser, 
> org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.ocr.TesseractOCRParser, 
> org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.ocr.TesseractOCRParser, 
> org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.ocr.TesseractOCRParser, 
> org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.ocr.TesseractOCRParser, 
> org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.ocr.TesseractOCRParser, 
> org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.ocr.TesseractOCRParser, 
> org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.ocr.TesseractOCRParser, 
> org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.ocr.TesseractOCRParser, 
> org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.ocr.TesseractOCRParser, 
> org.apache.tika.parser.CompositeParser, 
> org.apache.tika.parser.ocr.TesseractOCRParser, 
> org.apache.tika.parser.Compo
 siteParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.T  esseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parse  r.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser]
>
> Peter Kronenberg  |  Senior AI Analytic ENGINEER
> C: 703.887.5623
> [Torch 
> AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3
> dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRC
> U2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af99199
> 9d363a651>
> 4303 W. 119th St., Leawood, KS 
> 66209<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6L
> y93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29
> vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0OD
> kyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1c
> z0=&h=14c17a0e2f574c30b54332f7c4081ca7>
> https://us-east-2.protection.sophos.com?d=torch.ai&u=d3d3LlRPUkNILkFJ&
> i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=Nm1Pd1NUek94MUNheHppZ0RpaUZ4RVlYe
> mhyTlhSa1M3Ly9FUFhXeDc5dz0=&h=a835294803a7452c9dd75b1008ffebe7<https:/
> /us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5h
> aS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUy
> cFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
>
>
>

RE: Parsing PDF file - setting threshold of unmapped characters

Posted by Nick Burch <ap...@gagravarr.org>.

On Wed, 14 Apr 2021, Peter Kronenberg wrote:
> Anyone have any thoughts on this?

I think both an absolute and a percentage would be good, but I don't have 
enough experience to comment on your suggested numbers for those two 
thresholds, sorry!

Your idea on best vs fast touches on much older discussions on what to do 
when we have multiple possible parsers available. For example, an external 
program that's slow but official and very reliable, or a java library 
that's quick but misses some edge cases. We never did manage to reach a 
conclusion on that though...

Nick


> Subject: RE: Parsing PDF file - setting threshold of unmapped characters
>
> I’ve been thinking about this and I think it would be a good idea to change the comparison of unmapped characters to a percentage.  For example, you suggested
>
>
> unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or something?
>
>
>
> The percentage could be configurable.
>
>
>
> Another thought I had to was to have an AUTO_BEST and AUTO_FAST.  AUTO_FAST would have a higher threshold of Unmapped Characters, so that in most cases, it would just extract text and not use OCR.  The performance overhead of OCR is very high for not a lot of benefit given that it extracts 99% of the text.
>
> AUTO_BEST would have a lower threshold before OCR is triggered.
>
>
>
> Or just keep AUTO and allow the threshold to be configured, either by number of characters or percentage.  The only downside to this is that the user would have to understand it a little more to be able to set the threshold properly, instead of AUTO just working magically
>
>
>
> What do you thin/
>
>
> Peter Kronenberg  |  Senior AI Analytic ENGINEER
> C: 703.887.5623
> [Torch AI]<http://www.torch.ai/>
> 4303 W. 119th St., Leawood, KS 66209
> WWW.TORCH.AI<http://www.torch.ai/>
>
>
> From: Tim Allison <ta...@apache.org>>
> Sent: Monday, April 5, 2021 1:49 PM
> To: Peter Kronenberg <pe...@torch.ai>>
> Cc: user@tika.apache.org<ma...@tika.apache.org>
> Subject: Re: Parsing PDF file
>
> Y. You understand perfectly!
>
> I want "auto" to be the best it can be and most generally applicable across use cases.  For users who want high performance/better control, you might parse the PDF first with NO_OCR, and then make the determination on which pages to run OCR based on those statistics pulled out in the first parse.  Another key statistic in the decision would be the out of vocabulary measurement that you can get with an integration with tika-eval.
>
> So, in short, if there are clear, provable, general improvements to AUTO, we should make them.  If you want more refined control, let us know if the current metadata can be improved to help you develop your application for your use cases.
>
> On Mon, Apr 5, 2021 at 1:06 PM Peter Kronenberg <pe...@torch.ai>> wrote:
> You’re right that OCRing would result in slightly more accurate results in this case.  But the performance penalty is high.  Wondering if there is some intermediate option.
>
> I think I understand now why you are separately looking for unmapped characters as well as total characters.  If total characters is low, we assume the page is an image and OCR.  But if unmapped characters is high, it might still be straight text, but the unmapped characters will essentially result in unreadable characters
>
> From: Tim Allison <ta...@apache.org>>
> Sent: Monday, April 5, 2021 11:39 AM
> To: Peter Kronenberg <pe...@torch.ai>>
> Cc: user@tika.apache.org<ma...@tika.apache.org>
> Subject: Re: Parsing PDF file
>
> As for the metadata, we should add unique.  Given that multiple parsers can hit the same file, we need to record all of them (in this case: default, pdf, tesseract).
>
> As for tweaking the settings...I'm not sure as I look at the extracted text more.  There are quite a few bad ligatures /unmapped unicode chars which would render search for, e.g. "efficient", "affairs" useless.
>
> On Mon, Apr 5, 2021 at 10:40 AM Peter Kronenberg <pe...@torch.ai>> wrote:
> Yes, I think tweaking the criteria for Auto is a good idea.
> And if the parser list was a Set, that would automatically eliminate dups
>
> From: Tim Allison <ta...@apache.org>>
> Sent: Monday, April 5, 2021 10:15 AM
> To: user@tika.apache.org<ma...@tika.apache.org>
> Subject: Fwd: Parsing PDF file
>
> It looks like the ligatures don't have unicode mappings:
>
> "Division of Monetary A???airs"
>
>
> if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10)
>
> The issue is that this file has > 10 unmapped unicode chars per page.
>
> We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or something?
>
> We should also probably check to see if a parser is in the parsed by list before re-adding it?
>
>
>
> 0: pdf:charsPerPage : 1579
> 0: pdf:charsPerPage : 1891
> 0: pdf:charsPerPage : 2283
> 0: pdf:charsPerPage : 2224
> 0: pdf:charsPerPage : 1619
> 0: pdf:charsPerPage : 2177
> 0: pdf:charsPerPage : 1626
> 0: pdf:charsPerPage : 1313
> 0: pdf:charsPerPage : 1652
> 0: pdf:charsPerPage : 1493
> 0: pdf:charsPerPage : 1136
> 0: pdf:charsPerPage : 1477
> 0: pdf:charsPerPage : 1264
> 0: pdf:charsPerPage : 1994
> 0: pdf:charsPerPage : 2062
> 0: pdf:charsPerPage : 1756
> 0: pdf:charsPerPage : 2007
> 0: pdf:charsPerPage : 2202
> 0: pdf:charsPerPage : 2105
> 0: pdf:charsPerPage : 2106
> 0: pdf:charsPerPage : 1895
> 0: pdf:charsPerPage : 1978
> 0: pdf:charsPerPage : 1826
> 0: pdf:charsPerPage : 1742
> 0: pdf:charsPerPage : 2073
> 0: pdf:charsPerPage : 1882
> 0: pdf:charsPerPage : 1497
> 0: pdf:charsPerPage : 282
> 0: pdf:charsPerPage : 606
> 0: pdf:charsPerPage : 948
> 0: pdf:charsPerPage : 418
> 0: pdf:charsPerPage : 266
> 0: pdf:charsPerPage : 830
> 0: pdf:charsPerPage : 259
> 0: pdf:charsPerPage : 716
> 0: pdf:charsPerPage : 961
> 0: pdf:charsPerPage : 1325
> 0: pdf:charsPerPage : 1478
> 0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998 Radical Eye Software
> 0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows
> 0: pdf:docinfo:title : Inel4shannon.dvi
> 0: pdf:encrypted : false
> 0: pdf:hasMarkedContent : false
> 0: pdf:hasXFA : false
> 0: pdf:hasXMP : false
> 0: pdf:producer : Acrobat Distiller 3.01 for Windows
> 0: pdf:unmappedUnicodeCharsPerPage : 109
> 0: pdf:unmappedUnicodeCharsPerPage : 120
> 0: pdf:unmappedUnicodeCharsPerPage : 113
> 0: pdf:unmappedUnicodeCharsPerPage : 120
> 0: pdf:unmappedUnicodeCharsPerPage : 94
> 0: pdf:unmappedUnicodeCharsPerPage : 112
> 0: pdf:unmappedUnicodeCharsPerPage : 178
> 0: pdf:unmappedUnicodeCharsPerPage : 74
> 0: pdf:unmappedUnicodeCharsPerPage : 132
> 0: pdf:unmappedUnicodeCharsPerPage : 189
> 0: pdf:unmappedUnicodeCharsPerPage : 165
> 0: pdf:unmappedUnicodeCharsPerPage : 145
> 0: pdf:unmappedUnicodeCharsPerPage : 132
> 0: pdf:unmappedUnicodeCharsPerPage : 186
> 0: pdf:unmappedUnicodeCharsPerPage : 162
> 0: pdf:unmappedUnicodeCharsPerPage : 145
> 0: pdf:unmappedUnicodeCharsPerPage : 119
> 0: pdf:unmappedUnicodeCharsPerPage : 138
> 0: pdf:unmappedUnicodeCharsPerPage : 115
> 0: pdf:unmappedUnicodeCharsPerPage : 99
> 0: pdf:unmappedUnicodeCharsPerPage : 107
> 0: pdf:unmappedUnicodeCharsPerPage : 108
> 0: pdf:unmappedUnicodeCharsPerPage : 116
> 0: pdf:unmappedUnicodeCharsPerPage : 174
> 0: pdf:unmappedUnicodeCharsPerPage : 138
> 0: pdf:unmappedUnicodeCharsPerPage : 101
> 0: pdf:unmappedUnicodeCharsPerPage : 61
> 0: pdf:unmappedUnicodeCharsPerPage : 90
> 0: pdf:unmappedUnicodeCharsPerPage : 239
> 0: pdf:unmappedUnicodeCharsPerPage : 614
> 0: pdf:unmappedUnicodeCharsPerPage : 216
> 0: pdf:unmappedUnicodeCharsPerPage : 101
> 0: pdf:unmappedUnicodeCharsPerPage : 502
> 0: pdf:unmappedUnicodeCharsPerPage : 103
> 0: pdf:unmappedUnicodeCharsPerPage : 427
> 0: pdf:unmappedUnicodeCharsPerPage : 629
> 0: pdf:unmappedUnicodeCharsPerPage : 347
> 0: pdf:unmappedUnicodeCharsPerPage : 327
>
> On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
> Yes, 2.x
>
> From: Tim Allison <ta...@apache.org>>
> Sent: Monday, April 5, 2021 9:54 AM
> To: user@tika.apache.org<ma...@tika.apache.org>
> Subject: Re: Parsing PDF file
>
> Tika 2.x? Looking now.
>
> On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <pe...@torch.ai>> wrote:
> If I use OCRStrategy=no_ocr, the time it takes to process is orders of magnitude faster and I don’t see the calls to OCRParser (obviously) Why is it taking so long with auto?  If the page does not meet the criteria for OCR, then it shouldn’t be calling OCR at all, right?
>
> "X-TIKA:Parsed-By": "[org.apache.tika.parser.CompositeParser, org.apache.tika.parser.pdf.PDFParser]",
>
>
> From: Peter Kronenberg <pe...@torch.ai>>
> Sent: Monday, April 5, 2021 8:48 AM
> To: user@tika.apache.org<ma...@tika.apache.org>
> Subject: RE: {EXTERNAL}Parsing PDF file
>
> This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.
>
> Correction: I see one instance of PDFParser at the beginning, but why does it then alternate between OCRParser and CompositeParser?
>
> From: Peter Kronenberg <pe...@torch.ai>>
> Sent: Monday, April 5, 2021 8:41 AM
> To: user@tika.apache.org<ma...@tika.apache.org>
> Subject: {EXTERNAL}Parsing PDF file
>
> This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.
>
> CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
> Parsing the attached PDF file.   It is a text file, not scanned.  I’m using OCR_Strategy=Auto, extractInlineImages=false
>
> The output contains the following in the metadata.  I’m wondering 2 things.  First, why don’t I see PDFParser?
> And 2nd, why does it keep calling the TesseractOCRParser?  Once it determines that it is a PDF file, wouldn’t it stick with that?
> I’m asking because it seems to take longer to parse than I would expect and I’m wondering if the OCRParser is adding extra overhead
>
>
> "X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, org.apache.tika.parser.pdf.PDFParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.Compo
 siteParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.T
 esseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parse
 r.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser]
>
> Peter Kronenberg  |  Senior AI Analytic ENGINEER
> C: 703.887.5623
> [Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
> 4303 W. 119th St., Leawood, KS 66209<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=14c17a0e2f574c30b54332f7c4081ca7>
> WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
>
>
>

RE: Parsing PDF file - setting threshold of unmapped characters

Posted by Peter Kronenberg <pe...@torch.ai>.

Anyone have any thoughts on this?

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Peter Kronenberg
Sent: Sunday, April 11, 2021 9:21 PM
To: user@tika.apache.org; tallison@apache.org
Subject: RE: Parsing PDF file - setting threshold of unmapped characters

I’ve been thinking about this and I think it would be a good idea to change the comparison of unmapped characters to a percentage.  For example, you suggested


unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or something?



The percentage could be configurable.



Another thought I had to was to have an AUTO_BEST and AUTO_FAST.  AUTO_FAST would have a higher threshold of Unmapped Characters, so that in most cases, it would just extract text and not use OCR.  The performance overhead of OCR is very high for not a lot of benefit given that it extracts 99% of the text.

AUTO_BEST would have a lower threshold before OCR is triggered.



Or just keep AUTO and allow the threshold to be configured, either by number of characters or percentage.  The only downside to this is that the user would have to understand it a little more to be able to set the threshold properly, instead of AUTO just working magically



What do you thin/


Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>


From: Tim Allison <ta...@apache.org>>
Sent: Monday, April 5, 2021 1:49 PM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Parsing PDF file

Y. You understand perfectly!

I want "auto" to be the best it can be and most generally applicable across use cases.  For users who want high performance/better control, you might parse the PDF first with NO_OCR, and then make the determination on which pages to run OCR based on those statistics pulled out in the first parse.  Another key statistic in the decision would be the out of vocabulary measurement that you can get with an integration with tika-eval.

So, in short, if there are clear, provable, general improvements to AUTO, we should make them.  If you want more refined control, let us know if the current metadata can be improved to help you develop your application for your use cases.

On Mon, Apr 5, 2021 at 1:06 PM Peter Kronenberg <pe...@torch.ai>> wrote:
You’re right that OCRing would result in slightly more accurate results in this case.  But the performance penalty is high.  Wondering if there is some intermediate option.

I think I understand now why you are separately looking for unmapped characters as well as total characters.  If total characters is low, we assume the page is an image and OCR.  But if unmapped characters is high, it might still be straight text, but the unmapped characters will essentially result in unreadable characters

From: Tim Allison <ta...@apache.org>>
Sent: Monday, April 5, 2021 11:39 AM
To: Peter Kronenberg <pe...@torch.ai>>
Cc: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Parsing PDF file

As for the metadata, we should add unique.  Given that multiple parsers can hit the same file, we need to record all of them (in this case: default, pdf, tesseract).

As for tweaking the settings...I'm not sure as I look at the extracted text more.  There are quite a few bad ligatures /unmapped unicode chars which would render search for, e.g. "efficient", "affairs" useless.

On Mon, Apr 5, 2021 at 10:40 AM Peter Kronenberg <pe...@torch.ai>> wrote:
Yes, I think tweaking the criteria for Auto is a good idea.
And if the parser list was a Set, that would automatically eliminate dups

From: Tim Allison <ta...@apache.org>>
Sent: Monday, April 5, 2021 10:15 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Fwd: Parsing PDF file

It looks like the ligatures don't have unicode mappings:

"Division of Monetary A�airs"


if (totalCharsPerPage < 10 || unmappedUnicodeCharsPerPage > 10)

The issue is that this file has > 10 unmapped unicode chars per page.

We could change the heuristic to unmappedUnicodeCharsPerPage > 10 && percentUnmappedUnicodeChars > 0.2 or something?

We should also probably check to see if a parser is in the parsed by list before re-adding it?



0: pdf:charsPerPage : 1579
0: pdf:charsPerPage : 1891
0: pdf:charsPerPage : 2283
0: pdf:charsPerPage : 2224
0: pdf:charsPerPage : 1619
0: pdf:charsPerPage : 2177
0: pdf:charsPerPage : 1626
0: pdf:charsPerPage : 1313
0: pdf:charsPerPage : 1652
0: pdf:charsPerPage : 1493
0: pdf:charsPerPage : 1136
0: pdf:charsPerPage : 1477
0: pdf:charsPerPage : 1264
0: pdf:charsPerPage : 1994
0: pdf:charsPerPage : 2062
0: pdf:charsPerPage : 1756
0: pdf:charsPerPage : 2007
0: pdf:charsPerPage : 2202
0: pdf:charsPerPage : 2105
0: pdf:charsPerPage : 2106
0: pdf:charsPerPage : 1895
0: pdf:charsPerPage : 1978
0: pdf:charsPerPage : 1826
0: pdf:charsPerPage : 1742
0: pdf:charsPerPage : 2073
0: pdf:charsPerPage : 1882
0: pdf:charsPerPage : 1497
0: pdf:charsPerPage : 282
0: pdf:charsPerPage : 606
0: pdf:charsPerPage : 948
0: pdf:charsPerPage : 418
0: pdf:charsPerPage : 266
0: pdf:charsPerPage : 830
0: pdf:charsPerPage : 259
0: pdf:charsPerPage : 716
0: pdf:charsPerPage : 961
0: pdf:charsPerPage : 1325
0: pdf:charsPerPage : 1478
0: pdf:docinfo:creator_tool : dvips 5.83 (MiKTeX 1.11d) Copyright 1998 Radical Eye Software
0: pdf:docinfo:producer : Acrobat Distiller 3.01 for Windows
0: pdf:docinfo:title : Inel4shannon.dvi
0: pdf:encrypted : false
0: pdf:hasMarkedContent : false
0: pdf:hasXFA : false
0: pdf:hasXMP : false
0: pdf:producer : Acrobat Distiller 3.01 for Windows
0: pdf:unmappedUnicodeCharsPerPage : 109
0: pdf:unmappedUnicodeCharsPerPage : 120
0: pdf:unmappedUnicodeCharsPerPage : 113
0: pdf:unmappedUnicodeCharsPerPage : 120
0: pdf:unmappedUnicodeCharsPerPage : 94
0: pdf:unmappedUnicodeCharsPerPage : 112
0: pdf:unmappedUnicodeCharsPerPage : 178
0: pdf:unmappedUnicodeCharsPerPage : 74
0: pdf:unmappedUnicodeCharsPerPage : 132
0: pdf:unmappedUnicodeCharsPerPage : 189
0: pdf:unmappedUnicodeCharsPerPage : 165
0: pdf:unmappedUnicodeCharsPerPage : 145
0: pdf:unmappedUnicodeCharsPerPage : 132
0: pdf:unmappedUnicodeCharsPerPage : 186
0: pdf:unmappedUnicodeCharsPerPage : 162
0: pdf:unmappedUnicodeCharsPerPage : 145
0: pdf:unmappedUnicodeCharsPerPage : 119
0: pdf:unmappedUnicodeCharsPerPage : 138
0: pdf:unmappedUnicodeCharsPerPage : 115
0: pdf:unmappedUnicodeCharsPerPage : 99
0: pdf:unmappedUnicodeCharsPerPage : 107
0: pdf:unmappedUnicodeCharsPerPage : 108
0: pdf:unmappedUnicodeCharsPerPage : 116
0: pdf:unmappedUnicodeCharsPerPage : 174
0: pdf:unmappedUnicodeCharsPerPage : 138
0: pdf:unmappedUnicodeCharsPerPage : 101
0: pdf:unmappedUnicodeCharsPerPage : 61
0: pdf:unmappedUnicodeCharsPerPage : 90
0: pdf:unmappedUnicodeCharsPerPage : 239
0: pdf:unmappedUnicodeCharsPerPage : 614
0: pdf:unmappedUnicodeCharsPerPage : 216
0: pdf:unmappedUnicodeCharsPerPage : 101
0: pdf:unmappedUnicodeCharsPerPage : 502
0: pdf:unmappedUnicodeCharsPerPage : 103
0: pdf:unmappedUnicodeCharsPerPage : 427
0: pdf:unmappedUnicodeCharsPerPage : 629
0: pdf:unmappedUnicodeCharsPerPage : 347
0: pdf:unmappedUnicodeCharsPerPage : 327

On Mon, Apr 5, 2021 at 10:00 AM Peter Kronenberg <pe...@torch.ai>> wrote:
Yes, 2.x

From: Tim Allison <ta...@apache.org>>
Sent: Monday, April 5, 2021 9:54 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: Re: Parsing PDF file

Tika 2.x? Looking now.

On Mon, Apr 5, 2021 at 8:55 AM Peter Kronenberg <pe...@torch.ai>> wrote:
If I use OCRStrategy=no_ocr, the time it takes to process is orders of magnitude faster and I don’t see the calls to OCRParser (obviously) Why is it taking so long with auto?  If the page does not meet the criteria for OCR, then it shouldn’t be calling OCR at all, right?

 "X-TIKA:Parsed-By": "[org.apache.tika.parser.CompositeParser, org.apache.tika.parser.pdf.PDFParser]",


From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, April 5, 2021 8:48 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: RE: {EXTERNAL}Parsing PDF file

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

Correction: I see one instance of PDFParser at the beginning, but why does it then alternate between OCRParser and CompositeParser?

From: Peter Kronenberg <pe...@torch.ai>>
Sent: Monday, April 5, 2021 8:41 AM
To: user@tika.apache.org<ma...@tika.apache.org>
Subject: {EXTERNAL}Parsing PDF file

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
Parsing the attached PDF file.   It is a text file, not scanned.  I’m using OCR_Strategy=Auto, extractInlineImages=false

The output contains the following in the metadata.  I’m wondering 2 things.  First, why don’t I see PDFParser?
And 2nd, why does it keep calling the TesseractOCRParser?  Once it determines that it is a PDF file, wouldn’t it stick with that?
I’m asking because it seems to take longer to parse than I would expect and I’m wondering if the OCRParser is adding extra overhead


"X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, org.apache.tika.parser.pdf.PDFParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser]

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
4303 W. 119th St., Leawood, KS 66209<https://us-east-2.protection.sophos.com?d=google.com&u=aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS9tYXBzL3NlYXJjaC80MzAzK1cuKzExOXRoK1N0LiwrTGVhd29vZCwrS1MrNjYyMDk_ZW50cnk9Z21haWwmc291cmNlPWc=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=N2FQazRValYxZ2cyRHZLcXZnb1AzcTVlQVc0SHJFYXdjMkFPemVSR1M1cz0=&h=14c17a0e2f574c30b54332f7c4081ca7>
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>