You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Fiz Ahmed <fi...@gmail.com> on 2018/01/19 18:56:42 UTC

Issue with solr.HTMLStripCharFilterFactory

Hi Solr Experts,

I am using the HTMLStripCharFilterFactory for removing <html> tags in Body
element.

Body contains data like <html><body>Ipad</body></html>

I made changes in managed schema .

<field name="body" type="html" indexed="true" required="false" stored="true"
/>


<copyField source="body" dest="_text_"/>


---


     <fieldType name="html" stored="true" indexed="true" class=
"solr.TextField">

      <analyzer type="index">

        <charFilter class="solr.HTMLStripCharFilterFactory"/>

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <!-- in this example, we will only use synonyms at query time

                     <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>

        -->

        <filter class="solr.StopFilterFactory" ignoreCase="true" words=
"stopwords.txt"/>

        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
"1" generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0" splitOnCaseChange="1"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.KeywordMarkerFilterFactory" protected=
"protwords.txt"/>

        <filter class="solr.PorterStemFilterFactory"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <charFilter class="solr.HTMLStripCharFilterFactory"/>

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words=
"stopwords.txt"/>

        <filter class="solr.WordDelimiterFilterFactory" generateWordParts=
"1" generateNumberParts="1" catenateWords="0" catenateNumbers="0"
catenateAll="0" splitOnCaseChange="1"/>

        <filter class="solr.LowerCaseFilterFactory"/>

        <filter class="solr.KeywordMarkerFilterFactory" protected=
"protwords.txt"/>

        <filter class="solr.PorterStemFilterFactory"/>

        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

      </analyzer>

    </fieldType>


I restarted the Solr and Indexed again.


But When I Query in Solr Admin.. I am still getting the Search results with
Html Tags in it.



"body":"<body>Practically everytime I log onto Mogran, suddenly I see it
running


*Please let me know what will be the Issue…Am I Missing anything.*


Thanks

Fiz..

Re: Issue with solr.HTMLStripCharFilterFactory

Posted by Shawn Heisey <ap...@elyograg.org>.
On 1/19/2018 11:56 AM, Fiz Ahmed wrote:
> But When I Query in Solr Admin.. I am still getting the Search results with
> Html Tags in it.

Search results will always contain the actual content that was indexed. 
Analysis only happens to indexed data and/or queries, not stored data.

This is how Solr and Lucene have *always* worked.  It's not new behavior.

To achieve what you want, you will either need to use an update 
processor, or you'll need to adjust your indexing program to make the 
changes before it sends the data to Solr.

If you choose the update processor route, there is a built-in processor 
that has the same behavior as the HTML filter you are using.  Note that 
if you use that update processor, you won't need the html filter in the 
analyzer for the affected fields, because the HTML will be gone before 
the analysis runs.

https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/HTMLStripFieldUpdateProcessorFactory.html

You can always write a custom processor if you wish.  A custom processor 
might be required if you want your stored data to undergo some very 
extensive transformation.

Here's the documentation on update processors:

https://lucene.apache.org/solr/guide/6_6/update-request-processors.html

Thanks,
Shawn