You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Chandan Tamrakar <ch...@nepasoft.com> on 2014/05/06 07:15:33 UTC
Indexing scanned PDFs
we are using SOLr to index pdf documents but there are cases where PDFs
are usually a scanned document with no text to extract and index .
Is there a plugin or module in SOLR that we can integrate so that it would
actually extract a text / OCR and then index?
Thanks in advance
Chandan Tamrakar
Re: Indexing scanned PDFs
Posted by Jack Krupansky <ja...@basetechnology.com>.
Also, be aware that there a a lot of PDF files that have text which is the
result of a low-accuracy OCR scan of the page images in the PDF file.
High-accuracy OCR scan is rather expensive. You can usually tell if you have
a "scanned" PDF by zooming way in - a PDF file generated directly from a
word processor source file will retain smooth curves on characters while a
PDF generated from scanned page images will show heavy pixelation.
-- Jack Krupansky
-----Original Message-----
From: Alexandre Rafalovitch
Sent: Tuesday, May 6, 2014 1:30 AM
To: solr-user@lucene.apache.org
Subject: Re: Indexing scanned PDFs
Nothing I am aware of for Solr directly. You may have better luck
chasing this at TIKA mailing list, as that's what Solr uses under
covers to index PDF otherwise. Doing a quick search for Tika and OCR
brings up a number of links.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr
proficiency
On Tue, May 6, 2014 at 12:15 PM, Chandan Tamrakar
<ch...@nepasoft.com> wrote:
> we are using SOLr to index pdf documents but there are cases where PDFs
> are usually a scanned document with no text to extract and index .
>
> Is there a plugin or module in SOLR that we can integrate so that it would
> actually extract a text / OCR and then index?
>
>
> Thanks in advance
>
> Chandan Tamrakar
Re: Indexing scanned PDFs
Posted by Alexandre Rafalovitch <ar...@gmail.com>.
Nothing I am aware of for Solr directly. You may have better luck
chasing this at TIKA mailing list, as that's what Solr uses under
covers to index PDF otherwise. Doing a quick search for Tika and OCR
brings up a number of links.
Regards,
Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency
On Tue, May 6, 2014 at 12:15 PM, Chandan Tamrakar
<ch...@nepasoft.com> wrote:
> we are using SOLr to index pdf documents but there are cases where PDFs
> are usually a scanned document with no text to extract and index .
>
> Is there a plugin or module in SOLR that we can integrate so that it would
> actually extract a text / OCR and then index?
>
>
> Thanks in advance
>
> Chandan Tamrakar