You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel von der Helm <D....@neumueller.com> on 2017/08/10 08:13:24 UTC
How to remove Scripts and Styles in content of SOLR Indexes[content
field] while indexed through URL?
Hi,
if a fetched HTML page (using SimplePostTool: -Ddata=web) contains <script> and <style> tags inside the <body> tag (not in <head> tag ) the innerText ( i.e. EMAC/JS scripts and CSS styles) of these tags remains as part of document text inside the "content"/"_text_" field in indexed documents.
So when I search in _text_ for "push(arguments)", for example, i get a result :(
Any idea how to remove these unwanted content?
Using: Solr 6.6.0.
Solrconfig.xml:
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<str name="captureAttr">true</str>
<str name="fmap.meta">ignored_</str>
<str name="fmap.content">plaintext</str>
</lst>
</requestHandler>
Thanks in advance
Daniel
Re: How to remove Scripts and Styles in content of SOLR
Indexes[content field] while indexed through URL?
Posted by Steve Rowe <sa...@gmail.com>.
Hi Daniel,
HTMLStripCharFilterFactory in your index analyzer should do the trick: <https://lucene.apache.org/solr/guide/6_6/charfilterfactories.html#CharFilterFactories-solr.HTMLStripCharFilterFactory>
--
Steve
www.lucidworks.com
> On Aug 10, 2017, at 4:13 AM, Daniel von der Helm <D....@neumueller.com> wrote:
>
> Hi,
> if a fetched HTML page (using SimplePostTool: -Ddata=web) contains <script> and <style> tags inside the <body> tag (not in <head> tag ) the innerText ( i.e. EMAC/JS scripts and CSS styles) of these tags remains as part of document text inside the "content"/"_text_" field in indexed documents.
>
> So when I search in _text_ for "push(arguments)", for example, i get a result :(
> Any idea how to remove these unwanted content?
> Using: Solr 6.6.0.
> Solrconfig.xml:
>
> <requestHandler name="/update/extract"
> startup="lazy"
> class="solr.extraction.ExtractingRequestHandler" >
> <lst name="defaults">
> <str name="lowernames">true</str>
> <str name="uprefix">ignored_</str>
> <str name="captureAttr">true</str>
> <str name="fmap.meta">ignored_</str>
> <str name="fmap.content">plaintext</str>
> </lst>
> </requestHandler>
> Thanks in advance
> Daniel
>