You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-dev@lucene.apache.org by "Mert Sakarya (JIRA)" <ji...@apache.org> on 2009/06/09 16:28:08 UTC

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717688#action_12717688 ] 

Mert Sakarya commented on SOLR-42:
----------------------------------

I think this is a problem of Microsoft Word. No one can say that;

      ...valid html...<?xml:namespace prefix = o />...valid html...

is a valid HTML. Any HTMLParser should look for a "?>" after a "<?"

BUT! As a solution, I modified line 644 at HTMLStripReader.java as;

      //if (ch=='?' && peek()=='>') {
      if ((ch=='?' || ch=='/') && peek()=='>') { //This fixes Office Word problem, but might cause other problems!!! Be very careful.

And created my own HTMLStripReader in another jar file.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.