You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Steve Rowe (JIRA)" <ji...@apache.org> on 2014/09/11 22:25:34 UTC
[jira] [Comment Edited] (LUCENE-5943) HTML strip filter removes
text between < and >
[ https://issues.apache.org/jira/browse/LUCENE-5943?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14130648#comment-14130648 ]
Steve Rowe edited comment on LUCENE-5943 at 9/11/14 8:25 PM:
-------------------------------------------------------------
Angle brackets indicate tags in HTML, and are stripped by HTMLStripCharFilter. That's one of its main functions. If you don't want tags stripped out, don't use HTMLStripCharFilter.
was (Author: steve_rowe):
Angle brackets indicate tags in HTML, and are stripped by HTMLStripCharFilter. That's one of it's main functions. If you don't want tags stripped out, don't use HTMLStripCharFilter.
> HTML strip filter removes text between < and >
> ----------------------------------------------
>
> Key: LUCENE-5943
> URL: https://issues.apache.org/jira/browse/LUCENE-5943
> Project: Lucene - Core
> Issue Type: Bug
> Components: core/index
> Environment: Production
> Reporter: suleman mubarik
>
> If I have this as input “I love <pizza hut> so much”
> When I apply html striper it removes “pizza hut” and I get tokens "i", "love" ,"so", "much"
> And these are offsets I get back ((0,1), (2,6), (20,22), (23,27))
> Html strip filter should return "i", "love" ,"pizza", "hut", "so", "much"
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org