You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Erick Erickson (JIRA)" <ji...@apache.org> on 2013/11/30 14:24:38 UTC
[jira] [Resolved] (SOLR-678) HTMLStripStandardTokenizerFactory
doesn't interpret word boundaries on html tags correctly.
[ https://issues.apache.org/jira/browse/SOLR-678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Erick Erickson resolved SOLR-678.
---------------------------------
Resolution: Won't Fix
2013 Old JIRA cleanup
> HTMLStripStandardTokenizerFactory doesn't interpret word boundaries on html tags correctly.
> -------------------------------------------------------------------------------------------
>
> Key: SOLR-678
> URL: https://issues.apache.org/jira/browse/SOLR-678
> Project: Solr
> Issue Type: Bug
> Components: search
> Affects Versions: 1.2
> Environment: Mac OS X 10.5.4, java version "1.5.0_13"
> Reporter: Matt Connolly
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> The HTMLStripStandardTokenizerFactory filter does not place word boundaries on HTML tags like it should.
> For example, indexing the text "<h2>title</h2><p>some comment</p>" results in two words being indexed: "titlesome" and "comment" when there should be three words: "title" "some" and "comment".
> Not all tags need this, for example, it may be perfectly reasonable to write "<b>sub</b>script" to be indexed as "subscript" since the <b> is interpretted as inline, not block.
> I would suggest all block or paragraph tags be translated into spaces so that text on either side of the tag is considered separate tokens. eg: p div h1 h2 h3 h4 h5 h6 br hr pre (etc)
--
This message was sent by Atlassian JIRA
(v6.1#6144)
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org