You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "J.J. Larrea (JIRA)" <ji...@apache.org> on 2007/10/05 06:46:50 UTC
[jira] Issue Comment Edited: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532570 ] 

skeptikos edited comment on SOLR-42 at 10/4/07 9:45 PM:
----------------------------------------------------------

Here is the workaround I am using, along with a long comment explaining why:

{code:title=solrconfig.xml}
		<!--
			Special-case stuff for HTML tags in Abstract field:
			Originally we had
				<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
			but pre-stripping destroys offsets needed for highiighting.
			Tried an HTML tag-extraction RegEx as a post-process
				<tokenizer class="solr.WhitespaceTokenizerFactory"/>
				<filter class="solr.PatternReplaceFilterFactory"
					pattern="&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;"
					replacement=""
					replace="all"/>
			but it still doesn't adjust the offset and the subsequent WDF then created havoc.
			One solution is to split on whitespace or tag delimiters (making tags
			into text), and either index the tags or use StopFilter to remove 'em.
			But the chosen solution is to swallow an entire chain of tags and any whitespace
			which surrounds or separates them, leaving non-HTML < and > intact, or else runs
			of whitespace as normal.

		-->
			<tokenizer class="solr.PatternTokenizerFactory"
					pattern="(?:\s*&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;\s*)++|\s+"/>
			<filter class="solr.ISOLatin1AccentFilterFactory"/>
			...
{code}

without the XMLEncoding the RegEx is:

			{{(?:\s*</?\w+((\s+\w+(\s*=\s*(?:"*?&"'.*?'|[^'">\s]+))?)+\s*|\s*)/?>\s*)++|\s+}}

and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets.

      was (Author: skeptikos):
    Here is the workaround I am using, along with a long comment explaining why:

		<!--
			Special-case stuff for HTML tags in Abstract field:
			Originally we had
				<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
			but pre-stripping destroys offsets needed for highiighting.
			Tried an HTML tag-extraction RegEx as a post-process
				<tokenizer class="solr.WhitespaceTokenizerFactory"/>
				<filter class="solr.PatternReplaceFilterFactory"
					pattern="&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;"
					replacement=""
					replace="all"/>
			but it still doesn't adjust the offset and the subsequent WDF then created havoc.
			One solution is to split on whitespace or tag delimiters (making tags
			into text), and either index the tags or use StopFilter to remove 'em.
			But the chosen solution is to swallow an entire chain of tags and any whitespace
			which surrounds or separates them, leaving non-HTML < and > intact, or else runs
			of whitespace as normal.

		-->
			<tokenizer class="solr.PatternTokenizerFactory"
					pattern="(?:\s*&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;\s*)++|\s+"/>
			<filter class="solr.ISOLatin1AccentFilterFactory"/>
			...

without the XMLEncoding the RegEx is:

			(?:\s*</?\w+((\s+\w+(\s*=\s*(?:"*?&"'.*?'|[^'">\s]+))?)+\s*|\s*)/?>\s*)++|\s+

and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets.
  
> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>            Reporter: Andrew May
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.