You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Zheng Lin Edwin Yeo <ed...@gmail.com> on 2018/04/10 02:23:18 UTC

Text in images are not extracted and indexed to content

Hi,

Currently I am facing issue whereby the text in images file like jpg, bmp
are not being extracted out and indexed. After the indexing, Tika did
extract all the meta data out and index them under the fields attr_*.
However, the content field is always empty for images file. For other types
of document files like .doc, the content is extracted correctly.

I have already updated the tika-parsers-1.17.jar, under
\prg\apache\tika\parser\pdf\ for extractInlineImages to true.


What could be the reason?

I have just upgraded to Solr 7.3.0.

Regards,
Edwin

Re: Text in images are not extracted and indexed to content

Posted by Zheng Lin Edwin Yeo <ed...@gmail.com>.

Thanks for the reply.

It was due to the Tesseract OCR problem, as I have tried out the new
Tesseract 4 version on my system, and it does not set the path in the
Environment Variables, unlike the older Tesseract 3, which set the path
automatically during installation.

Regards,
Edwin

On 10 April 2018 at 18:58, Shamik Sinha <sh...@gmail.com> wrote:

> To index text in images the image needs to be searchable i. e. text needs
> to be overlayed on the image like a searchable pdf. You can do this using
> ocr but it is a bit unreliable if the images are scanned copies of written
> text.
>
> On 10-Apr-2018 4:12 PM, "Rahul Singh" <ra...@gmail.com>
> wrote:
>
> May need to extract outside SolR and index pure text with an external
> ingestion process. You have much more control over the Tika attributes and
> behaviors.
>
> --
> Rahul Singh
> rahul.singh@anant.us
>
> Anant Corporation
>
>
> On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo <edwinyeozl@gmail.com
> >,
> wrote:
> > Hi,
> >
> > Currently I am facing issue whereby the text in images file like jpg, bmp
> > are not being extracted out and indexed. After the indexing, Tika did
> > extract all the meta data out and index them under the fields attr_*.
> > However, the content field is always empty for images file. For other
> types
> > of document files like .doc, the content is extracted correctly.
> >
> > I have already updated the tika-parsers-1.17.jar, under
> > \prg\apache\tika\parser\pdf\ for extractInlineImages to true.
> >
> >
> > What could be the reason?
> >
> > I have just upgraded to Solr 7.3.0.
> >
> > Regards,
> > Edwin
>

Re: Text in images are not extracted and indexed to content

Posted by Shamik Sinha <sh...@gmail.com>.

To index text in images the image needs to be searchable i. e. text needs
to be overlayed on the image like a searchable pdf. You can do this using
ocr but it is a bit unreliable if the images are scanned copies of written
text.

On 10-Apr-2018 4:12 PM, "Rahul Singh" <ra...@gmail.com> wrote:

May need to extract outside SolR and index pure text with an external
ingestion process. You have much more control over the Tika attributes and
behaviors.

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo <ed...@gmail.com>,
wrote:
> Hi,
>
> Currently I am facing issue whereby the text in images file like jpg, bmp
> are not being extracted out and indexed. After the indexing, Tika did
> extract all the meta data out and index them under the fields attr_*.
> However, the content field is always empty for images file. For other
types
> of document files like .doc, the content is extracted correctly.
>
> I have already updated the tika-parsers-1.17.jar, under
> \prg\apache\tika\parser\pdf\ for extractInlineImages to true.
>
>
> What could be the reason?
>
> I have just upgraded to Solr 7.3.0.
>
> Regards,
> Edwin

Re: Text in images are not extracted and indexed to content

Posted by Rahul Singh <ra...@gmail.com>.

May need to extract outside SolR and index pure text with an external ingestion process. You have much more control over the Tika attributes and behaviors.

--
Rahul Singh
rahul.singh@anant.us

Anant Corporation

On Apr 9, 2018, 10:23 PM -0400, Zheng Lin Edwin Yeo <ed...@gmail.com>, wrote:
> Hi,
>
> Currently I am facing issue whereby the text in images file like jpg, bmp
> are not being extracted out and indexed. After the indexing, Tika did
> extract all the meta data out and index them under the fields attr_*.
> However, the content field is always empty for images file. For other types
> of document files like .doc, the content is extracted correctly.
>
> I have already updated the tika-parsers-1.17.jar, under
> \prg\apache\tika\parser\pdf\ for extractInlineImages to true.
>
>
> What could be the reason?
>
> I have just upgraded to Solr 7.3.0.
>
> Regards,
> Edwin