You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Phillip Wu <ph...@unsw.edu.au> on 2020/07/08 00:01:53 UTC

solr query to return matched text to regex with default schema

Hi,
I want to search Solr for server names in a set of Microsoft Word documents, PDF, and image files like jpg,gif.
Server names are given by the regular expression(regex)
INFP[a-zA-z0-9]{3,9}
TRKP[a-zA-z0-9]{3,9}
PLCP[a-zA-z0-9]{3,9}
SQRP[a-zA-z0-9]{3,9}
....

Problem
=======
I want to get the text in the documents matching the regex. eg. INFPWSV01, PLCPLDB01

I've index the files using Solr/Tikka/Tesseract using the default schema.

I've used the highlight search tool
hl ticked
hl.usePhraseHighlighter ticked

Solr only returns the metadata (presumably) like filename for the file containing the pattern(s).

Questions
=========
1. Would I have to modify the managed schema?
2. If so would I have to save the file content in the schema
3. If so is this the way to do it:
a. solrconfig.xml <- inside my "core"
<requestHandler class="solr.extraction.ExtractingRequestHandler" name="/update/extract" startup="lazy">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
...
b. Remove line
<str name="fmap.meta">ignored_</str>
as I want meta data
c. Change this to the managed schema
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
stored to "true"
curl -X POST -H 'Content-type:application/json' --data-binary '{
  "replace-field":{
     "name":"_text_",
     "type":"text_general",
     "multiValued":true,
     "indexed":true
     "stored":true }
}' http://localhost:8983/api/cores/gettingstarted/schema