You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Miguel Fernandes <mi...@gmail.com> on 2019/05/03 15:07:08 UTC

Help extracting text from PDF images when indexing files

Hi all,

I'm new to Solr, i've recently downloaded solr 8.0.0 and have been
following the tutorials. Using the 2 example instances created, i'm trying
to create my own collection. I've done a copy of the _default configset and
used it to create my collection.

For my case, the files i want to index are pdf files composed of images. I
have tesseract installed and i can parse correctly the pdf files using an
tika server instance i downloaded, i.e i can get the extracted text from
the images.

I'm following the instructions on from page "Uploading Data with Solr Cell
Using Apache Tika" to propertly configure the PDF image extraction but i'm
not being able to correctly get this. My aim is that the content of the PDF
file goes into a field named content that i've created in my schema. From
my attempts this field is non existent or when it exists it doesnt contain
the expected text from the parsed images.

In the configuration of ExtractingRequestHandler, the lib clauses are
present in my solrconfig.xml, that section is as below:

  <requestHandler name="/update/extract"
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="fmap.content">content</str>
    </lst>
    <str name="parseContext.config">parseContext.xml</str>
  </requestHandler>

And my parseContext.xml file is:

<?xml version="1.0" encoding="UTF-8" ?>
<entries>
    <entry class="org.apache.tika.parser.pdf.PDFParserConfig"
impl="org.apache.tika.parser.pdf.PDFParserConfig">
        <property name="extractInlineImages" value="true" />
    </entry>
</entries>

Any help on how to correctly extract the text from the PDF images would be
great.
Thanks
Miguel