You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@tika.apache.org by Peter Kronenberg <pe...@torch.ai> on 2021/04/05 12:41:13 UTC

Parsing PDF file

Parsing the attached PDF file.   It is a text file, not scanned.  I'm using OCR_Strategy=Auto, extractInlineImages=false

The output contains the following in the metadata.  I'm wondering 2 things.  First, why don't I see PDFParser?
And 2nd, why does it keep calling the TesseractOCRParser?  Once it determines that it is a PDF file, wouldn't it stick with that?
I'm asking because it seems to take longer to parse than I would expect and I'm wondering if the OCRParser is adding extra overhead


"X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, org.apache.tika.parser.pdf.PDFParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser]

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<http://www.torch.ai/>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<http://www.torch.ai/>



RE: {EXTERNAL}Parsing PDF file

Posted by Peter Kronenberg <pe...@torch.ai>.
Correction: I see one instance of PDFParser at the beginning, but why does it then alternate between OCRParser and CompositeParser?

From: Peter Kronenberg <pe...@torch.ai>
Sent: Monday, April 5, 2021 8:41 AM
To: user@tika.apache.org
Subject: {EXTERNAL}Parsing PDF file

This email was sent from outside your organisation, yet is displaying the name of someone from your organisation. This often happens in phishing attempts. Please only interact with this email if you know its source and that the content is safe.

CAUTION: This email originated from outside of the organization. DO NOT click links or open attachments unless you recognize the sender and know the content is safe.
Parsing the attached PDF file.   It is a text file, not scanned.  I'm using OCR_Strategy=Auto, extractInlineImages=false

The output contains the following in the metadata.  I'm wondering 2 things.  First, why don't I see PDFParser?
And 2nd, why does it keep calling the TesseractOCRParser?  Once it determines that it is a PDF file, wouldn't it stick with that?
I'm asking because it seems to take longer to parse than I would expect and I'm wondering if the OCRParser is adding extra overhead


"X-TIKA:Parsed-By":[org.apache.tika.parser.CompositeParser, org.apache.tika.parser.pdf.PDFParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser, org.apache.tika.parser.CompositeParser, org.apache.tika.parser.ocr.TesseractOCRParser]

Peter Kronenberg  |  Senior AI Analytic ENGINEER
C: 703.887.5623
[Torch AI]<https://us-east-2.protection.sophos.com/?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>
4303 W. 119th St., Leawood, KS 66209
WWW.TORCH.AI<https://us-east-2.protection.sophos.com?d=torch.ai&u=aHR0cDovL3d3dy50b3JjaC5haS8=&i=NjAwMDY2MjNjNzQ1NDY0ODkyYTNmNTg3&t=dHRDUUJralFuRnRCU2tvcmRLNUUycFdBV2RmazdTZU0zZUZVM21GSXhobz0=&h=38113a42f3384422af991999d363a651>