You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@tika.apache.org by "Allison, Timothy B." <ta...@mitre.org> on 2017/06/20 15:04:22 UTC

RE: Tesseract - OCR and Tika

Bouncing to user@

Are you able to share the document?

How are you running OCR exactly:
1) running OCR on extracted inline images
2) rendering page and then running OCR on the rendered image

What is the quality of the image?

Are you using the right language pack for the language?

-----Original Message-----
From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.nasa.gov] 
Sent: Tuesday, June 20, 2017 10:02 AM
To: dev@tika.apache.org
Cc: Ravi Gadapa <ra...@yahoo.com>
Subject: Re: Tesseract - OCR and Tika

FWD’ing to the Tika list (note TO: address change)


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


From: Ravi Gadapa <ra...@yahoo.com>
Date: Monday, June 19, 2017 at 8:56 PM
To: "dev-owner@tika.apache.org" <de...@tika.apache.org>
Subject: Tesseract - OCR and Tika

I have been using it for our project and i seem to have problem extracting the data from pdf documents. Below is the sample how it extracts.

'EldAJ. iNEIWEI‘IEI ‘IVHG El‘c'l TIVHS SEIHOJJMS TIV "8 'NOILVGNEIWINOOEIEI ElElElﬂiOVdﬂNVW iNEIWdIﬂOEI ElElcl SV 3|in EIWVN S.J_NE|V\ld|ﬂOE| NO GEISVEI EIEI TIVHS HOJJMS iOEINNOOSIG iNEIWdIﬂOEI HO:| EIZIS ElSﬂzl TIV 'Z 'GEliON EISIMEIEIHLO SSEI‘INH ‘EldAJ. EltlﬂSO‘IONEI HS VINEIN NI EIEI TIVHS SEIHOJJMS iOEINNOOSIG HOOGiﬂO TIV 'L


Any suggestions

Thanks

RE: RE: Tesseract - OCR and Tika

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Hi Ravi,
   I think the problem is the different alignments for the images.  For whatever reason, tesseract is not correctly rotating the second tif file image2.tif, even with psm=1.  When I manually extract that image, manually rotate it and resave it, the OCR is of decent quality.
  I got decent quality when I used strategy 2 for OCR, which is to render the full page as a single image and then run OCR on that:

<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="ocrStrategy" type="string">ocr_and_text</param>
            </params>
        </parser>
    </parsers>
</properties>


From: Allison, Timothy B. [mailto:tallison@mitre.org]
Sent: Wednesday, June 21, 2017 3:03 PM
To: 'user@tika.apache.org' <us...@tika.apache.org>
Cc: Ravi Gadapa <ra...@yahoo.com>
Subject: RE: RE: Tesseract - OCR and Tika

Hi Ravi,
  Let’s keep the discussion as public as possible.  I won’t share the document that you sent to my personal email account, of course.
   In the email stream of my life, I missed your follow up email.  Thank you for the ping and the info.  I’ll take a look shortly.

From: Ravi Gadapa [mailto:ravi_gadapa@yahoo.com]
Sent: Wednesday, June 21, 2017 1:58 PM
To: Allison, Timothy B. <ta...@mitre.org>>
Subject: Re: RE: Tesseract - OCR and Tika

Just checking to see if you have any resolution for this.

Thx


Attached is the code i am using to run with english language package with attached file.

//
            Parser autoDetectParser = new AutoDetectParser();
            BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
            ParseContext context = new ParseContext();

            TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
            ocrConfig.setTesseractPath(tesseractbin);
            ocrConfig.setTessdataPath(tessdataFolder);
            PDFParserConfig pdfConfig = new PDFParserConfig();
            pdfConfig.setExtractInlineImages(true);
            pdfConfig.setExtractUniqueInlineImagesOnly(false);

            context.set(Parser.class, autoDetectParser);
            context.set(TesseractOCRConfig.class, ocrConfig);
            context.set(PDFParserConfig.class, pdfConfig);

            log.info("OCR PARSING {} - START");
            log.info("Tesseract Data path: {} install path: {}", ocrConfig.getTessdataPath(),
                    ocrConfig.getTesseractPath());
            autoDetectParser.parse(stream, handler, new Metadata(), context);
            text = handler.toString();
            log.info("OCR DATA {}", text);
            log.info("OCR PARSING {} - END");
//


Thanks




________________________________
On Tuesday, June 20, 2017, 11:04:33 AM EDT, Allison, Timothy B. <ta...@mitre.org>> wrote:


Bouncing to user@

Are you able to share the document?

How are you running OCR exactly:
1) running OCR on extracted inline images
2) rendering page and then running OCR on the rendered image

What is the quality of the image?

Are you using the right language pack for the language?

-----Original Message-----
From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.nasa.gov<ma...@jpl.nasa.gov>]
Sent: Tuesday, June 20, 2017 10:02 AM
To: dev@tika.apache.org<ma...@tika.apache.org>
Cc: Ravi Gadapa <ra...@yahoo.com>>
Subject: Re: Tesseract - OCR and Tika

FWD’ing to the Tika list (note TO: address change)


++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


From: Ravi Gadapa <ra...@yahoo.com>>
Date: Monday, June 19, 2017 at 8:56 PM
To: "dev-owner@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
Subject: Tesseract - OCR and Tika

I have been using it for our project and i seem to have problem extracting the data from pdf documents. Below is the sample how it extracts.

'EldAJ. iNEIWEI‘IEI ‘IVHG El‘c'l TIVHS SEIHOJJMS TIV "8 'NOILVGNEIWINOOEIEI ElElElﬂiOVdﬂNVW iNEIWdIﬂOEI ElElcl SV 3|in EIWVN S.J_NE|V\ld|ﬂOE| NO GEISVEI EIEI TIVHS HOJJMS iOEINNOOSIG iNEIWdIﬂOEI HO:| EIZIS ElSﬂzl TIV 'Z 'GEliON EISIMEIEIHLO SSEI‘INH ‘EldAJ. EltlﬂSO‘IONEI HS VINEIN NI EIEI TIVHS SEIHOJJMS iOEINNOOSIG HOOGiﬂO TIV 'L


Any suggestions

Thanks

RE: RE: Tesseract - OCR and Tika

Posted by "Allison, Timothy B." <ta...@mitre.org>.

Hi Ravi,
  Let’s keep the discussion as public as possible.  I won’t share the document that you sent to my personal email account, of course.
   In the email stream of my life, I missed your follow up email.  Thank you for the ping and the info.  I’ll take a look shortly.

From: Ravi Gadapa [mailto:ravi_gadapa@yahoo.com]
Sent: Wednesday, June 21, 2017 1:58 PM
To: Allison, Timothy B. <ta...@mitre.org>
Subject: Re: RE: Tesseract - OCR and Tika

Just checking to see if you have any resolution for this.

Thx

Attached is the code i am using to run with english language package with attached file.

//
            Parser autoDetectParser = new AutoDetectParser();
            BodyContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
            ParseContext context = new ParseContext();

            TesseractOCRConfig ocrConfig = new TesseractOCRConfig();
            ocrConfig.setTesseractPath(tesseractbin);
            ocrConfig.setTessdataPath(tessdataFolder);
            PDFParserConfig pdfConfig = new PDFParserConfig();
            pdfConfig.setExtractInlineImages(true);
            pdfConfig.setExtractUniqueInlineImagesOnly(false);

            context.set(Parser.class, autoDetectParser);
            context.set(TesseractOCRConfig.class, ocrConfig);
            context.set(PDFParserConfig.class, pdfConfig);

            log.info("OCR PARSING {} - START");
            log.info("Tesseract Data path: {} install path: {}", ocrConfig.getTessdataPath(),
                    ocrConfig.getTesseractPath());
            autoDetectParser.parse(stream, handler, new Metadata(), context);
            text = handler.toString();
            log.info("OCR DATA {}", text);
            log.info("OCR PARSING {} - END");
//

Thanks

________________________________
On Tuesday, June 20, 2017, 11:04:33 AM EDT, Allison, Timothy B. <ta...@mitre.org>> wrote:

Bouncing to user@

Are you able to share the document?

How are you running OCR exactly:
1) running OCR on extracted inline images
2) rendering page and then running OCR on the rendered image

What is the quality of the image?

Are you using the right language pack for the language?

-----Original Message-----
From: Mattmann, Chris A (3010) [mailto:chris.a.mattmann@jpl.nasa.gov<ma...@jpl.nasa.gov>]
Sent: Tuesday, June 20, 2017 10:02 AM
To: dev@tika.apache.org<ma...@tika.apache.org>
Cc: Ravi Gadapa <ra...@yahoo.com>>
Subject: Re: Tesseract - OCR and Tika

FWD’ing to the Tika list (note TO: address change)

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattmann@nasa.gov<ma...@nasa.gov>
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

From: Ravi Gadapa <ra...@yahoo.com>>
Date: Monday, June 19, 2017 at 8:56 PM
To: "dev-owner@tika.apache.org<ma...@tika.apache.org>" <de...@tika.apache.org>>
Subject: Tesseract - OCR and Tika

I have been using it for our project and i seem to have problem extracting the data from pdf documents. Below is the sample how it extracts.

'EldAJ. iNEIWEI‘IEI ‘IVHG El‘c'l TIVHS SEIHOJJMS TIV "8 'NOILVGNEIWINOOEIEI ElElElﬂiOVdﬂNVW iNEIWdIﬂOEI ElElcl SV 3|in EIWVN S.J_NE|V\ld|ﬂOE| NO GEISVEI EIEI TIVHS HOJJMS iOEINNOOSIG iNEIWdIﬂOEI HO:| EIZIS ElSﬂzl TIV 'Z 'GEliON EISIMEIEIHLO SSEI‘INH ‘EldAJ. EltlﬂSO‘IONEI HS VINEIN NI EIEI TIVHS SEIHOJJMS iOEINNOOSIG HOOGiﬂO TIV 'L

Any suggestions

Thanks