You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Karan Jain <sa...@gmail.com> on 2020/02/11 18:14:47 UTC

Support Tesseract in Apache Solr

Hi All,

The Solr version 7.6.0 is running on my local machine. I have installed
Tesseract through following steps:-
yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
>>~/.bash_profile
echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile

Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
in https://github.com/apache/lucene-solr and found no reference there. I
could not understand How Solr came to know about the deployed tesseract.
Please tell the specific java class in Solr if possible.

Thanks for your time,
Best,
Karan

Re: Support Tesseract in Apache Solr

Posted by Edward Ribeiro <ed...@gmail.com>.

I second Jorn: don't deploy Tesseract + Tika on the same server as Solr.
Tesseract, specially with OCR enabled, will drain your machine resources
that could be used to indexing/searching. In addition to that, any
malformed PDF could potentially shutdown the Solr server. Best bet would be
to use tika-server + tesseract on a dedicated server/container and then use
it to extract the text/ocr from the documents and then send it to Solr.

But answering your question: Solr embeds Tika that can be configured to use
Tesseract. It's Tika that knows about Tesseract. See here:
https://cwiki.apache.org/confluence/display/TIKA/TikaOCR for more
information.

Best regards,
Edward

On Tue, Feb 11, 2020 at 3:26 PM Jörn Franke <jo...@gmail.com> wrote:

> Honestly i would not run tesseract on the same server as Solr. It takes a
> lot of resources and may negatively impact Solr. Just write a small program
> using Tika+Tesseract that runs on a different server / container and posts
> the results to Solr.
>
> About your question: Probably Tika (a dependency of Solr) figured it out
> or depending on your format Pdfbox (used by Tika).
>
> > Am 11.02.2020 um 19:15 schrieb Karan Jain <sa...@gmail.com>:
> >
> > Hi All,
> >
> > The Solr version 7.6.0 is running on my local machine. I have installed
> > Tesseract through following steps:-
> > yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
> >>> ~/.bash_profile
> > echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile
> >
> > Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
> > in https://github.com/apache/lucene-solr and found no reference there. I
> > could not understand How Solr came to know about the deployed tesseract.
> > Please tell the specific java class in Solr if possible.
> >
> > Thanks for your time,
> > Best,
> > Karan
>

Re: Support Tesseract in Apache Solr

Posted by Jörn Franke <jo...@gmail.com>.

Honestly i would not run tesseract on the same server as Solr. It takes a lot of resources and may negatively impact Solr. Just write a small program using Tika+Tesseract that runs on a different server / container and posts the results to Solr.

About your question: Probably Tika (a dependency of Solr) figured it out or depending on your format Pdfbox (used by Tika).

> Am 11.02.2020 um 19:15 schrieb Karan Jain <sa...@gmail.com>:
> 
> Hi All,
> 
> The Solr version 7.6.0 is running on my local machine. I have installed
> Tesseract through following steps:-
> yum install tesseract echo export PATH=$PATH:/usr/share/tesseract
>>> ~/.bash_profile
> echo export TESSDATA_PREFIX=/usr/share/tesseract >>~/.bash_profile
> 
> Now the deployed Solr is supporting tesseract. I searched TESSDATA_PREFIX
> in https://github.com/apache/lucene-solr and found no reference there. I
> could not understand How Solr came to know about the deployed tesseract.
> Please tell the specific java class in Solr if possible.
> 
> Thanks for your time,
> Best,
> Karan