You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Phillip Wu <ph...@unsw.edu.au> on 2020/07/08 00:01:53 UTC
solr query to return matched text to regex with default schema
Hi,
I want to search Solr for server names in a set of Microsoft Word documents, PDF, and image files like jpg,gif.
Server names are given by the regular expression(regex)
INFP[a-zA-z0-9]{3,9}
TRKP[a-zA-z0-9]{3,9}
PLCP[a-zA-z0-9]{3,9}
SQRP[a-zA-z0-9]{3,9}
....
Problem
=======
I want to get the text in the documents matching the regex. eg. INFPWSV01, PLCPLDB01
I've index the files using Solr/Tikka/Tesseract using the default schema.
I've used the highlight search tool
hl ticked
hl.usePhraseHighlighter ticked
Solr only returns the metadata (presumably) like filename for the file containing the pattern(s).
Questions
=========
1. Would I have to modify the managed schema?
2. If so would I have to save the file content in the schema
3. If so is this the way to do it:
a. solrconfig.xml <- inside my "core"
<requestHandler class="solr.extraction.ExtractingRequestHandler" name="/update/extract" startup="lazy">
<lst name="defaults">
<str name="lowernames">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">_text_</str>
</lst>
...
b. Remove line
<str name="fmap.meta">ignored_</str>
as I want meta data
c. Change this to the managed schema
<field name="_text_" type="text_general" multiValued="true" indexed="true" stored="false"/>
stored to "true"
curl -X POST -H 'Content-type:application/json' --data-binary '{
"replace-field":{
"name":"_text_",
"type":"text_general",
"multiValued":true,
"indexed":true
"stored":true }
}' http://localhost:8983/api/cores/gettingstarted/schema