You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fiz Ahmed <fi...@gmail.com> on 2018/01/19 18:56:42 UTC
Issue with solr.HTMLStripCharFilterFactory
Hi Solr Experts,
I am using the HTMLStripCharFilterFactory for removing <html> tags in Body
element.
Body contains data like <html><body>Ipad</body></html>
I made changes in managed schema .
<field name="body" type="html" indexed="true" required="false" stored="true"
/>
<copyField source="body" dest="_text_"/>
---
<fieldType name="html" stored="true" indexed="true" class=
"solr.TextField">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words=
"stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts=
"1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected=
"protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words=
"stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts=
"1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected=
"protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I restarted the Solr and Indexed again.
But When I Query in Solr Admin.. I am still getting the Search results with
Html Tags in it.
"body":"<body>Practically everytime I log onto Mogran, suddenly I see it
running
*Please let me know what will be the Issue…Am I Missing anything.*
Thanks
Fiz..
Re: Issue with solr.HTMLStripCharFilterFactory
Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/19/2018 11:56 AM, Fiz Ahmed wrote:
> But When I Query in Solr Admin.. I am still getting the Search results with
> Html Tags in it.
Search results will always contain the actual content that was indexed.
Analysis only happens to indexed data and/or queries, not stored data.
This is how Solr and Lucene have *always* worked. It's not new behavior.
To achieve what you want, you will either need to use an update
processor, or you'll need to adjust your indexing program to make the
changes before it sends the data to Solr.
If you choose the update processor route, there is a built-in processor
that has the same behavior as the HTML filter you are using. Note that
if you use that update processor, you won't need the html filter in the
analyzer for the affected fields, because the HTML will be gone before
the analysis runs.
https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html
You can always write a custom processor if you wish. A custom processor
might be required if you want your stored data to undergo some very
extensive transformation.
Here's the documentation on update processors:
https://lucene.apache.org/solr/guide/6_6/update-request-processors.html
Thanks,
Shawn