You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2017/04/04 06:00:19 UTC

Using Tesseract OCR to extract PDF files in EML file attachment

Hi,

Currently, I am able to extract scanned PDF images and index them to Solr
using Tesseract OCR, although the speed is very slow.

However, for EML files with PDF attachments that consist of scanned images,
the Tesseract OCR is not able to extract the text from those PDF
attachments.

Can we use the same method for EML files? Or what are the suggestions that
we can do to extract those attachments?

I'm using Solr 6.5.0

Regards,
Edwin

Re: Using Tesseract OCR to extract PDF files in EML file attachment

Posted by Rick Leir <rl...@leirtech.com>.

Tesseract prolly knows nothing of the EML format. Your scripts could pull EML's apart.

On April 4, 2017 2:00:19 AM EDT, Zheng Lin Edwin Yeo <ed...@gmail.com> wrote:
>Hi,
>
>Currently, I am able to extract scanned PDF images and index them to
>Solr
>using Tesseract OCR, although the speed is very slow.
>
>However, for EML files with PDF attachments that consist of scanned
>images,
>the Tesseract OCR is not able to extract the text from those PDF
>attachments.
>
>Can we use the same method for EML files? Or what are the suggestions
>that
>we can do to extract those attachments?
>
>I'm using Solr 6.5.0
>
>Regards,
>Edwin

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Using Tesseract OCR to extract PDF files in EML file attachment

Posted by Charlie Hull <ch...@flax.co.uk>.

My colleagues Eric Pugh and Dan Worley covered OCR and Solr in a 
presentation at our recent London Lucene/Solr Meetup:
https://www.meetup.com/Apache-Lucene-Solr-London-User-Group/events/264579498/
(direct link to slides if you can't find it in the comments 
https://www.slideshare.net/o19s/payloads-and-ocr-with-solr)

HTH

Charlie


On 14/10/2019 11:40, Retro wrote:
> Hello, thanks for answer, but let me explain the setup. We are running our
> own backup solution for emails (messages from Exchange in MSG format).
> Content of these messages then indexed in SOLR. But SOLR can not process
> attachments within those MSG files, can not OCR them. This is what I need -
> to OCR attachments and get their content indexed in SOLR.
>
> Davis, Daniel (NIH/NLM) [C] wrote
>> Nuance and ABBYY provide OCR capabilities as well.
>> Looking at higher level solutions, both indexengines.com and Comvault can
>> do email remediation for legal issues.
>>> AJ Weber wrote
>>>> There are alternative, paid, libraries to parse and extract attachments
>>>> from EML files as well
>>>> EML attachments will have a mimetype associated with their metadata.
>>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
>
>
>
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html


-- 
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk

RE: Using Tesseract OCR to extract PDF files in EML file attachment

Posted by Retro <ho...@mail.ru>.

Hello, thanks for answer, but let me explain the setup. We are running our
own backup solution for emails (messages from Exchange in MSG format).
Content of these messages then indexed in SOLR. But SOLR can not process
attachments within those MSG files, can not OCR them. This is what I need -
to OCR attachments and get their content indexed in SOLR. 

Davis, Daniel (NIH/NLM) [C] wrote
> Nuance and ABBYY provide OCR capabilities as well.
> Looking at higher level solutions, both indexengines.com and Comvault can
> do email remediation for legal issues.
>> AJ Weber wrote
>> > There are alternative, paid, libraries to parse and extract attachments
>> > from EML files as well
>> > EML attachments will have a mimetype associated with their metadata.
>> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html





--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

RE: Using Tesseract OCR to extract PDF files in EML file attachment

Posted by "Davis, Daniel (NIH/NLM) [C]" <da...@nih.gov>.

Nuance and ABBYY provide OCR capabilities as well.

Looking at higher level solutions, both indexengines.com and Comvault can do email remediation for legal issues.

> -----Original Message-----
> From: Retro <ho...@mail.ru>
> Sent: Friday, October 11, 2019 8:06 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Using Tesseract OCR to extract PDF files in EML file attachment
> 
> AJ Weber wrote
> > There are alternative, paid, libraries to parse and extract attachments
> > from EML files as well
> > EML attachments will have a mimetype associated with their metadata.
> 
> Hello, can you give a hint what are those commercial libraries that would do
> the job? We need to index MSG files and OCR attachments within MSG.
> Tesseract can not do this, and I'm having hard time to find the solution.
> Thank you!
> 
> 
> 
> --
> Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Using Tesseract OCR to extract PDF files in EML file attachment

Posted by Retro <ho...@mail.ru>.

AJ Weber wrote
> There are alternative, paid, libraries to parse and extract attachments 
> from EML files as well
> EML attachments will have a mimetype associated with their metadata.

Hello, can you give a hint what are those commercial libraries that would do
the job? We need to index MSG files and OCR attachments within MSG. 
Tesseract can not do this, and I'm having hard time to find the solution.
Thank you!



--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Re: Using Tesseract OCR to extract PDF files in EML file attachment

Posted by AJ Weber <aw...@comcast.net>.

You'll need to use something like javax mail (or some of the jars that 
have been built on top of it for higher-level access) to open the EML 
files and extract the attachments, then operate on the extracted 
attachments as you would any file.

There are alternative, paid, libraries to parse and extract attachments 
from EML files as well.

EML attachments will have a mimetype associated with their metadata.

On 4/4/2017 2:00 AM, Zheng Lin Edwin Yeo wrote:
> Hi,
>
> Currently, I am able to extract scanned PDF images and index them to Solr
> using Tesseract OCR, although the speed is very slow.
>
> However, for EML files with PDF attachments that consist of scanned images,
> the Tesseract OCR is not able to extract the text from those PDF
> attachments.
>
> Can we use the same method for EML files? Or what are the suggestions that
> we can do to extract those attachments?
>
> I'm using Solr 6.5.0
>
> Regards,
> Edwin
>