You are viewing a plain text version of this content. The canonical link for it is here.
Posted to by Phillip Wu <> on 2020/07/08 00:01:53 UTC

solr query to return matched text to regex with default schema

I want to search Solr for server names in a set of Microsoft Word documents, PDF, and image files like jpg,gif.
Server names are given by the regular expression(regex)

I want to get the text in the documents matching the regex. eg. INFPWSV01, PLCPLDB01

I've index the files using Solr/Tikka/Tesseract using the default schema.

I've used the highlight search tool
hl ticked
hl.usePhraseHighlighter ticked

Solr only returns the metadata (presumably) like filename for the file containing the pattern(s).

1. Would I have to modify the managed schema?
2. If so would I have to save the file content in the schema
3. If so is this the way to do it:
a. solrconfig.xml <- inside my "core"
<requestHandler class="solr.extraction.ExtractingRequestHandler" name="/update/extract" startup="lazy">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
b. Remove line
<str name="fmap.meta">ignored_</str>
as I want meta data
c. Change this to the managed schema
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
stored to "true"
curl -X POST -H 'Content-type:application/json' --data-binary '{
     "stored":true }
}' http://localhost:8983/api/cores/gettingstarted/schema