You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-dev@lucene.apache.org by "Andrew May (JIRA)" <ji...@apache.org> on 2006/07/28 22:44:13 UTC

[jira] Created: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Highlighting problems with HTMLStripWhitespaceTokenizerFactory
--------------------------------------------------------------

Key: SOLR-42
URL: http://issues.apache.org/jira/browse/SOLR-42
Project: Solr
Issue Type: Bug
Components: update
Reporter: Andrew May

Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).

Example title field:

40Ar/39Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia

Searching for title:fabrics with highlighting on, the highlighted version has the tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).

Response from Yonik on the solr-user mailing-list:

HTMLStripWhitespaceTokenizerFactory works in two phases...
HTMLStripReader removes the HTML and passes the result to
WhitespaceTokenizer... at that point, Tokens are generated, but the
offsets will correspond to the text after HTML removal, not before.

I did it this way so that HTMLStripReader could go before any
tokenizer (like StandardTokenizer).

Can you open a JIRA bug for this? The fix would be a special version
of HTMLStripReader integrated with a WhitespaceTokenizer to keep
offsets correct.

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "David Smiley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848542#action_12848542 ] 

David Smiley commented on SOLR-42:
----------------------------------

I was just reading this whole thread and SOLR-1394.  It seems this issue can be closed as a duplicate of SOLR-1394.  No?

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Stuart Sierra (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12542184 ] 

Stuart Sierra commented on SOLR-42:
-----------------------------------

Another workaround is simply to strip HTML tags and entities before sending the document to Solr.  This gives correct highlights.  But if you want to retrieve the original HTML document you'll need to stash it somewhere else.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556928#action_12556928 ] 

Grant Ingersoll commented on SOLR-42:
-------------------------------------

Yet another problem: 
In certain circumstances, it is possible that restoreState() cannot be invoked b/c the mark has been lost due to moving well beyond it.  This is most noticeable in the while (true) loop inside of readProcessingInstruction() and can be caused by the following test:
{code}
public void testQuestionMark() throws Exception {
    StringBuilder testBuilder = new StringBuilder(5020);
    testBuilder.append("ah<?> ");
    for (int i = 0; i < 5000; i++){
      testBuilder.append('a');//tack on enough to go beyond the mark readahead limit, since <?> makes HTMLStripReader think it is a processing instruction
    }
    String test = testBuilder.toString();
    Reader reader = new HTMLStripReader(new BufferedReader(new StringReader(test)));//force the use of BufferedReader
    int ch = 0;
    StringBuilder builder = new StringBuilder();
    try {
      while ((ch = reader.read()) != -1){
        builder.append((char)ch);
      }
    } finally {
      System.out.println("String: " + builder.toString());
    }
    assertTrue(builder.toString() + " is not equal to " + test, builder.toString().equals(test) == true);
  }
{code}

In this case, the final assert never gets hit, because there is an IOException in reader.read of:
{noformat}
java.io.IOException: Mark invalid
	at java.io.BufferedReader.reset(BufferedReader.java:485)
	at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:158)
	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:731)
	at org.apache.solr.analysis.HTMLStripReaderTest.testQuestionMark(HTMLStripReaderTest.java:171)
{noformat}

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-42:
--------------------------------

    Attachment: SOLR-42.patch

Patch that replaces HTML cruft with whitespace, thus preserving highlighting at the expense of extra characters.  Also included the a test file and test case

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556748#action_12556748 ] 

Hoss Man commented on SOLR-42:
------------------------------

Hmmm.... I was assuming this would be an option on both the HTMLStripReader and the Tokenizers that use it (the tokenizers taking the option only to pass it on to the Reader) but i see what you mean ... once the Tokenizer knows the character positions of the "words" coming out of the text, it can then strip out those characters (Hmmm... strip the characters that are a placeholder for when other characters where already striped ... why do i feel like we're going to go to hell for this?)

Tf there were characters that we were *certain* would never appear in any unicode string, we could do it all under the covers by picking one of them ... but the safest thing to do would still be to have it as an option (but with a sensible default from the "private use" range instead of an empty string).  ... 

So the HtmlStripReader would have a constructor that looks like...

{code}
   /** 
   * @param entityFiller character to replace gaps made when entities are collapsed to real characters so that character positions still line up, may be null if no filler should be used 
   * @param tagFiller character to replace gaps made when entities are collapsed to real characters so that character positions still line up, may be null if no filler should be used 
   */
  public HtmlStripReader(Reader input, Character entityFiller, Character tagFiller) { ... }
{code}

and the Tokenizers could look like...

{code}
public class HTMLStripStandardTokenizerFactory extends BaseTokenizerFactory {
  Pattern fillerPattern;
  Character entityFiller, tagFiller; 
  public void init(...) {
    entityFiller = getInitParam(...);
    tagFiller = getInitParam(...)
    fillerPattern = getInitParam(stripFiller) ? makePattern(entityFiller, tagFiller) : null;
  }
  public TokenStream create(Reader input) {
    TokenStream s = new StandardTokenizer(new HTMLStripReader(input,entityFiller, tagFiller);
    If (null != fillerPatterm) {
      s = new PatternReplaceFiler(s, fillerPattern, "", true);
    }
    return s;
  }
}
{code}

that should totally work right?


> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yonik Seeley resolved SOLR-42.
------------------------------

    Resolution: Fixed

committed.  Thanks Grant!

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-42:
--------------------------------

    Attachment: SOLR-42.patch

OK, here's a patch for the entity problem, but yes, I do agree Yonik, a better approach is needed.

It might be as simple as the SolrHighlighter being aware the the HTMLStripReader is being used.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned SOLR-42:
-----------------------------------

    Assignee: Grant Ingersoll

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: HTMLStripReaderTest.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Reopened: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hoss Man reopened SOLR-42:
--------------------------


> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-42:
--------------------------------

    Attachment: SOLR-42.patch

Here's a new version, with some more testing and the ability to preserve certain tags via a passed in set.  This is useful for text where some tags are meaningful or at least are useful further down the chain.  I am not sure why the html test file is not included.  svn stat shows:
{quote}
A  +   src/test/test-files/htmlStripReaderTest.html
A      src/test/org/apache/solr/analysis/HTMLStripReaderTest.java
M      src/java/org/apache/solr/analysis/HTMLStripReader.java
{quote}



> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Mert Sakarya (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717688#action_12717688 ] 

Mert Sakarya commented on SOLR-42:
----------------------------------

I think this is a problem of Microsoft Word. No one can say that;

      ...valid html...<?xml:namespace prefix = o />...valid html...

is a valid HTML. Any HTMLParser should look for a "?>" after a "<?"

BUT! As a solution, I modified line 644 at HTMLStripReader.java as;

      //if (ch=='?' && peek()=='>') {
      if ((ch=='?' || ch=='/') && peek()=='>') { //This fixes Office Word problem, but might cause other problems!!! Be very careful.

And created my own HTMLStripReader in another jar file.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556158#action_12556158 ] 

Yonik Seeley commented on SOLR-42:
----------------------------------

Hmmm, I'm still getting the exception, even after adding the file.
Did you test with "ant clean test", or with a GUI?

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Work started: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on SOLR-42 started by Grant Ingersoll.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556098#action_12556098 ] 

Grant Ingersoll commented on SOLR-42:
-------------------------------------

Hmm, I don't know if this completely solves the problem even though the code seems like it is right.  I am still having trouble w/ the offsets.

has anyone else tried the patch?

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556740#action_12556740 ] 

Yonik Seeley commented on SOLR-42:
----------------------------------

{quote}i just have to wonder if there is a special marker character that could be used instead of whitespace{quote}
For a hack, not a bad idea...
There could be a TokenFilter that removes any such characters in tokens, and it could even be automatically used by Tokenizers that use the html strip reader.


> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628475#action_12628475 ] 

Lance Norskog commented on SOLR-42:
-----------------------------------

Another bigger nit: the last time I studied this, the HTML stripper code did not pull alt-text fields as searchable tags.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "J.J. Larrea (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532570 ] 

skeptikos edited comment on SOLR-42 at 10/4/07 9:45 PM:
----------------------------------------------------------

Here is the workaround I am using, along with a long comment explaining why:

{code:title=solrconfig.xml}
		<!--
			Special-case stuff for HTML tags in Abstract field:
			Originally we had
				<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
			but pre-stripping destroys offsets needed for highiighting.
			Tried an HTML tag-extraction RegEx as a post-process
				<tokenizer class="solr.WhitespaceTokenizerFactory"/>
				<filter class="solr.PatternReplaceFilterFactory"
					pattern="&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;"
					replacement=""
					replace="all"/>
			but it still doesn't adjust the offset and the subsequent WDF then created havoc.
			One solution is to split on whitespace or tag delimiters (making tags
			into text), and either index the tags or use StopFilter to remove 'em.
			But the chosen solution is to swallow an entire chain of tags and any whitespace
			which surrounds or separates them, leaving non-HTML < and > intact, or else runs
			of whitespace as normal.

		-->
			<tokenizer class="solr.PatternTokenizerFactory"
					pattern="(?:\s*&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;\s*)++|\s+"/>
			<filter class="solr.ISOLatin1AccentFilterFactory"/>
			...
{code}

without the XMLEncoding the RegEx is:

			{{(?:\s*</?\w+((\s+\w+(\s*=\s*(?:"*?&"'.*?'|[^'">\s]+))?)+\s*|\s*)/?>\s*)++|\s+}}

and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets.

      was (Author: skeptikos):
    Here is the workaround I am using, along with a long comment explaining why:

		<!--
			Special-case stuff for HTML tags in Abstract field:
			Originally we had
				<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
			but pre-stripping destroys offsets needed for highiighting.
			Tried an HTML tag-extraction RegEx as a post-process
				<tokenizer class="solr.WhitespaceTokenizerFactory"/>
				<filter class="solr.PatternReplaceFilterFactory"
					pattern="&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;"
					replacement=""
					replace="all"/>
			but it still doesn't adjust the offset and the subsequent WDF then created havoc.
			One solution is to split on whitespace or tag delimiters (making tags
			into text), and either index the tags or use StopFilter to remove 'em.
			But the chosen solution is to swallow an entire chain of tags and any whitespace
			which surrounds or separates them, leaving non-HTML < and > intact, or else runs
			of whitespace as normal.

		-->
			<tokenizer class="solr.PatternTokenizerFactory"
					pattern="(?:\s*&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;\s*)++|\s+"/>
			<filter class="solr.ISOLatin1AccentFilterFactory"/>
			...

without the XMLEncoding the RegEx is:

			(?:\s*</?\w+((\s+\w+(\s*=\s*(?:"*?&"'.*?'|[^'">\s]+))?)+\s*|\s*)/?>\s*)++|\s+

and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets.
  
> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>            Reporter: Andrew May
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-42:
--------------------------------

    Attachment: SOLR-42.patch

OK, I'm 99.99% confident this fixes the issues w/ the while (true) loops and handles entities properly, etc.  and highlighting works correctly.  Added more tests, etc.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556634#action_12556634 ] 

Grant Ingersoll commented on SOLR-42:
-------------------------------------

Hmmm, still seems to be a problem with entities.  I think we need to replace the entity w/ the appropriate amount of whitespace.  I will work up test and patch.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556928#action_12556928 ] 

gsingers edited comment on SOLR-42 at 1/8/08 8:21 AM:
-------------------------------------------------------------

Yet another problem: 
In certain circumstances, it is possible that restoreState() cannot be invoked b/c the mark has been lost due to moving well beyond it.  This is most noticeable in the while (true) loop inside of readProcessingInstruction() and can be caused by the following test:
{code}
public void testQuestionMark() throws Exception {
    StringBuilder testBuilder = new StringBuilder(5020);
    testBuilder.append("ah<?> ");
    for (int i = 0; i < 5000; i++){
      testBuilder.append('a');//tack on enough to go beyond the mark readahead limit, since <?> makes HTMLStripReader think it is a processing instruction
    }
    String test = testBuilder.toString();
    Reader reader = new HTMLStripReader(new BufferedReader(new StringReader(test)));//force the use of BufferedReader
    int ch = 0;
    StringBuilder builder = new StringBuilder();
    try {
      while ((ch = reader.read()) != -1){
        builder.append((char)ch);
      }
    } finally {
      System.out.println("String: " + builder.toString());
    }
    assertTrue(builder.toString() + " is not equal to " + test, builder.toString().equals(test) == true);
  }
{code}

In this case, the final assert never gets hit, because there is an IOException in reader.read of:
{noformat}
java.io.IOException: Mark invalid
	at java.io.BufferedReader.reset(BufferedReader.java:485)
	at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:158)
	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:731)
	at org.apache.solr.analysis.HTMLStripReaderTest.testQuestionMark(HTMLStripReaderTest.java:171)
{noformat}

I will add that this is based off of text seen in the wild.

      was (Author: gsingers):
    Yet another problem: 
In certain circumstances, it is possible that restoreState() cannot be invoked b/c the mark has been lost due to moving well beyond it.  This is most noticeable in the while (true) loop inside of readProcessingInstruction() and can be caused by the following test:
{code}
public void testQuestionMark() throws Exception {
    StringBuilder testBuilder = new StringBuilder(5020);
    testBuilder.append("ah<?> ");
    for (int i = 0; i < 5000; i++){
      testBuilder.append('a');//tack on enough to go beyond the mark readahead limit, since <?> makes HTMLStripReader think it is a processing instruction
    }
    String test = testBuilder.toString();
    Reader reader = new HTMLStripReader(new BufferedReader(new StringReader(test)));//force the use of BufferedReader
    int ch = 0;
    StringBuilder builder = new StringBuilder();
    try {
      while ((ch = reader.read()) != -1){
        builder.append((char)ch);
      }
    } finally {
      System.out.println("String: " + builder.toString());
    }
    assertTrue(builder.toString() + " is not equal to " + test, builder.toString().equals(test) == true);
  }
{code}

In this case, the final assert never gets hit, because there is an IOException in reader.read of:
{noformat}
java.io.IOException: Mark invalid
	at java.io.BufferedReader.reset(BufferedReader.java:485)
	at org.apache.solr.analysis.HTMLStripReader.restoreState(HTMLStripReader.java:158)
	at org.apache.solr.analysis.HTMLStripReader.read(HTMLStripReader.java:731)
	at org.apache.solr.analysis.HTMLStripReaderTest.testQuestionMark(HTMLStripReaderTest.java:171)
{noformat}
  
> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-42:
-----------------------------

    Attachment: HtmlStripReaderTestXmlProcessing.patch

Updating test case to reflect the fact that offset info still gets screwed up in the XML processing instruction case even if there are no XML elements in the source XML

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "J.J. Larrea (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532570 ] 

J.J. Larrea commented on SOLR-42:
---------------------------------

Here is the workaround I am using, along with a long comment explaining why:

		<!--
			Special-case stuff for HTML tags in Abstract field:
			Originally we had
				<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
			but pre-stripping destroys offsets needed for highiighting.
			Tried an HTML tag-extraction RegEx as a post-process
				<tokenizer class="solr.WhitespaceTokenizerFactory"/>
				<filter class="solr.PatternReplaceFilterFactory"
					pattern="&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;"
					replacement=""
					replace="all"/>
			but it still doesn't adjust the offset and the subsequent WDF then created havoc.
			One solution is to split on whitespace or tag delimiters (making tags
			into text), and either index the tags or use StopFilter to remove 'em.
			But the chosen solution is to swallow an entire chain of tags and any whitespace
			which surrounds or separates them, leaving non-HTML < and > intact, or else runs
			of whitespace as normal.

		-->
			<tokenizer class="solr.PatternTokenizerFactory"
					pattern="(?:\s*&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;\s*)++|\s+"/>
			<filter class="solr.ISOLatin1AccentFilterFactory"/>
			...

without the XMLEncoding the RegEx is:

			(?:\s*</?\w+((\s+\w+(\s*=\s*(?:"*?&"'.*?'|[^'">\s]+))?)+\s*|\s*)/?>\s*)++|\s+

and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>            Reporter: Andrew May
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555601#action_12555601 ] 

gsingers edited comment on SOLR-42 at 1/3/08 8:21 AM:
-------------------------------------------------------------

Patch that replaces HTML cruft with whitespace, thus preserving highlighting at the expense of extra characters.  Also included the a test file and test case

NOTE: This is not backward-compatible with old HTMLStripReader, but seeing how it requires re-indexing to use anyway, I don't see that it is a concern.

      was (Author: gsingers):
    Patch that replaces HTML cruft with whitespace, thus preserving highlighting at the expense of extra characters.  Also included the a test file and test case
  
> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556668#action_12556668 ] 

Mike Klaas commented on SOLR-42:
--------------------------------

> It might be as simple as the SolrHighlighter being aware the the HTMLStripReader is being used.

Grant, thanks for looking in to this issue.  What do you imagine the highlighter being able to do with that knowledge?  

Note that the entity problem is an issue for search, not just highlighting.  The proper tokens will not be created in the case of international chars

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Francisco Sanmartín (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628650#action_12628650 ] 

Francisco Sanmartín commented on SOLR-42:
-----------------------------------------

The patch is for solr 1.3? 

I tried to apply the patch to solr 1.2 but it fails :(

Can anybody complete the "affected version" and "fix version" field? thanks!

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mike Klaas updated SOLR-42:
---------------------------

    Component/s:     (was: update)
                 highlighter

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Mike Klaas (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556676#action_12556676 ] 

Mike Klaas commented on SOLR-42:
--------------------------------

> Of course, the real answer may be as suggested earlier and to apply the stripreader before sending to Solr.

HTMLStripTokenizer currently breaks the tokenizer contract, so it seems like the real answer is to fix the offsets.  I've glanced at the code, and it would be a significant amount of work to make the current implementation adhere to this contract.  The main problem is that no-one is really interested in doing this work.

> > What do you imagine the highlighter being able to do with that knowledge?

> My understanding of looking at the code is the disjoint comes from line 298. In the call to Lucene's highlighter, we pass in the TokenStream, which has > been stripped (or will be stripped if the the HTMLStripReader is employed) and the value from the stored field (docTexts[0]). If, docTexts[0] was stripped first, > then I think the offsets would be the same, no? Of course, it would be really easy to test.

You know, this is incredibly hacky, but I think that it is a great idea.

-Mike



> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462338 ] 

Hoss Man commented on SOLR-42:
------------------------------

Suggestion from Mirko on solr-dev: change HTMLStripReader to replace striped HTML with equal length whitespace.

(this could possibly be made a constructor option)

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>            Reporter: Andrew May
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "solrize (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620545#action_12620545 ] 

solrize commented on SOLR-42:
-----------------------------

I am getting a ton of these errors in real documents with the 2008-07-30 nightly build.  Any advice?  Thanks.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556729#action_12556729 ] 

Hoss Man commented on SOLR-42:
------------------------------

I don't know much about unicode, but there are *so* many special characters in unicode, i just have to wonder if there is a special marker character that could be used instead of whitespace to "fill in the gaps" left when converting entities to real characters (or stripping tags).  ... something that isn't printable, and does't trigger any "boundary" logic (ie: note whitespace, punctuation, letter, digit, etc...)

 * NUL perhaps?  (can you legally embed null in a string in java?)
 * does anyone understand the definition of a "nonspacing mark" ?
 * the "Invisible Separator" character?
 * a "private use" character?  (this actually seems like the most promising option)

I say we just punt: have two options that allows users to specify characters: one for when tags are striped, one for when entities are converted to normal characters ... default both to an empty string (ie: current behavior)

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556636#action_12556636 ] 

Yonik Seeley commented on SOLR-42:
----------------------------------

Hmmm, this points out a deficiency in this approach... it could break up words or tokens (with whitespace) that were not originally separated (think international char in the middle of a word).
So I think this approach is probably OK for now, but a better approach would have the tokenizer get the offsets from the reader somehow (perhaps just a whitespace tokenizer with HTML stripping integrated).

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Hoss Man (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480555 ] 

Hoss Man commented on SOLR-42:
------------------------------

the most helpful contribution would be a patch containing a JUnit test demonstrating hte problem and the necessary fix :)

http://wiki.apache.org/solr/HowToContribute

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>            Reporter: Andrew May
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556946#action_12556946 ] 

Grant Ingersoll commented on SOLR-42:
-------------------------------------

OK, I think the solution for these pieces is to change the 
while (true) loops inside of HTMLStripReader to only read up to READAHEAD chars so that it can always fall back to the last mark


I will work up a patch.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "solrize (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641883#action_12641883 ] 

solrize commented on SOLR-42:
-----------------------------

This problem is still present in solr 1.3.0 and I'm getting a ton of those "mark invalid" exceptions.  Is this really a minor issue?!!!!

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Lance Norskog (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12628473#action_12628473 ] 

Lance Norskog commented on SOLR-42:
-----------------------------------

A slight nit: Unicode contains three characters, fffD, fffE, and fffF, which are not supposed to appear in any document. One (fffD i think) is the offical "not a character" character, and should be used for the purpose of separating terms.



> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556160#action_12556160 ] 

Yonik Seeley commented on SOLR-42:
----------------------------------

To elaborate, I think the CWD when running tests under ant is "src\test\test-files", so adding that to the filename when you try to open the file seems incorrect.  To match ant, I have intellij set up to use src\test\test-files as the CWD when running tests.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Anders Melchiorsen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12749421#action_12749421 ] 

Anders Melchiorsen commented on SOLR-42:
----------------------------------------

I found the issue with entity replacement breaking up words, and created SOLR-1394.

My patch in that issue is to HTMLStripCharFilter.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Paul Fryer (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480554 ] 

Paul Fryer commented on SOLR-42:
--------------------------------

Any update on this? I was hoping to use Solr to do full text search of HTML documents and provide snippets with search terms highlighted.  I assume I won't be able to provide accurate HTML snippet highlighting until this issue is resolved - correct? If so, what sort of time frame would be be looking at? Anything I can contribute to help?

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>            Reporter: Andrew May
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-42:
--------------------------------

    Attachment: HTMLStripReaderTest.java

Here's the start of a test for the HTMLStripReader.  Note, in order to incorporate my suggestion of replacing HTML tags w/ whitespace, this test would need to be modified.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>         Attachments: HTMLStripReaderTest.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-42:
--------------------------------

    Attachment: htmlStripReaderTest.html

Here is the file.  It shows it as being added in svn stat, but not sure why it wasn't included.

It goes in src/test/test-files

It's just a copy of site/index.html

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Jim Murphy (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12625835#action_12625835 ] 

Jim Murphy commented on SOLR-42:
--------------------------------

I've tracked down some background info on this issue - at least the way it was affecting me.  I could care less about highlighting - I'm using the HTMLStripWhitespaceTokenizerFactory during indexing to tokenize blog content - which obviously contains lots of html.  

The pathological case I've found with our input document set is:

Content contains a malformed xml processing instruction in the first "page" of the buffer that contains more than one page of data.

It seems this is a fairly common (maybe MS Word XML?) form of invalid HTML.  Commonly it looks like this:

...valid html...<?xml:namespace prefix = o />...valid html...

Notice the PI starts with "<?xml" but terminates with a close tag., doh.

This issue is manifested in HTMLStripReader. It causes the following code to read too much off the buffer and invalidates the previous mark at the beginning of the tag.

private int readProcessingInstruction() throws IOException {
    // "<?" has already been read
    while ((numRead - lastMark) < readAheadLimitMinus1) {
      int ch = next();
      if (ch=='?' && peek()=='>') {
        next();
        return MATCH;
      } else if (ch==-1) {
        return MISMATCH;
      }

    }
    return MISMATCH;
  }

The demoralizing part is the special treatment (readAheadLimitMinus1) isn't enough.  There is actually a "over read" by 2 chars.

The IOException - Invalid Mark happens when readProcessingInstruction() retuns (a mismatch because the entire buffer is read without finding the close PI) and restoreState(); is called to reset the marks - which fails.

If I tweak readAheadLimitMinus1 like this

readAheadLimitMinus1 -= 2 

So maybe the variable should be readAheadLimitMinus3 ;)

then the buffer limits are preserved and the exception isn't found, parsing proceedes as expected.

Jim  



> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Yonik Seeley (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556104#action_12556104 ] 

Yonik Seeley commented on SOLR-42:
----------------------------------

Grant, I'm getting a test failure... did you forget to "svn add" some files?

    <error message="src\test\test-files\htmlStripReaderTest.html (The system ca
not find the path specified)" type="java.io.FileNotFoundException">java.io.File
otFoundException: src\test\test-files\htmlStripReaderTest.html (The system cann
t find the path specified)
        at java.io.FileInputStream.open(Native Method)
        at java.io.FileInputStream.&lt;init&gt;(FileInputStream.java:106)
        at java.io.FileReader.&lt;init&gt;(FileReader.java:55)
        at org.apache.solr.analysis.HTMLStripReaderTest.testHTML(HTMLStripReade
Test.java:65)

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556670#action_12556670 ] 

Grant Ingersoll commented on SOLR-42:
-------------------------------------

{quote}
What do you imagine the highlighter being able to do with that knowledge?
{quote}

My understanding of looking at the code is the disjoint comes from line 298.  In the call to Lucene's highlighter, we pass in the TokenStream, which has been stripped (or will be stripped if the the HTMLStripReader is employed) and the value from the stored field (docTexts[0]).  If, docTexts[0] was stripped first, then I _think_ the offsets would be the same, no?  Of course, it would be really easy to test.  

Of course, the real answer may be as suggested earlier and to apply the stripreader before sending to Solr.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556149#action_12556149 ] 

Grant Ingersoll commented on SOLR-42:
-------------------------------------

{quote}
There is definitely still an issue w/ the HTMLStripReader in some circumstances, working on a test now.
{quote}

Red herring.  The problem I was seeing is due to some tags that are valid syntax for my text are being stripped.  I think this patch can go ahead, although it might be useful to keep certain tags (i.e. a set of tags)

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Erik Hatcher (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557395#action_12557395 ] 

Erik Hatcher commented on SOLR-42:
----------------------------------

patch applied, all tests still pass, and committed.  Thanks Grant!

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556161#action_12556161 ] 

Grant Ingersoll commented on SOLR-42:
-------------------------------------

You are right.  I forgot that the CWD is different for Solr.  Feel  
free to strip off that initial path stuff, or I can generate a new  
patch.

-Grant



--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com
http://www.lucenebootcamp.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ






> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12556113#action_12556113 ] 

Grant Ingersoll commented on SOLR-42:
-------------------------------------

There is definitely still an issue w/ the HTMLStripReader in some circumstances, working on a test now.

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: HTMLStripReaderTest.java, SOLR-42.patch
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12555559#action_12555559 ] 

Grant Ingersoll commented on SOLR-42:
-------------------------------------

Might a solution to this be to replace the removed characters with whitespace, thus preserving the input offsets?

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "J.J. Larrea (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12532570 ] 

skeptikos edited comment on SOLR-42 at 10/4/07 10:03 PM:
-----------------------------------------------------------

Here is the workaround I am using, along with a long comment explaining why:

{code:title=solrconfig.xml}
		<!--
			Special-case stuff for HTML tags in Abstract field:
			Originally we had
				<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
			but pre-stripping destroys offsets needed for highiighting.
			Tried an HTML tag-extraction RegEx as a post-process
				<tokenizer class="solr.WhitespaceTokenizerFactory"/>
				<filter class="solr.PatternReplaceFilterFactory"
					pattern="&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;"
					replacement=""
					replace="all"/>
			but it still doesn't adjust the offset and the subsequent WDF then created havoc.
			One solution is to split on whitespace or tag delimiters (making tags
			into text), and either index the tags or use StopFilter to remove 'em.
			But the chosen solution is to swallow an entire chain of tags and any whitespace
			which surrounds or separates them, leaving non-HTML < and > intact, or else runs
			of whitespace as normal.

		-->
			<tokenizer class="solr.PatternTokenizerFactory"
					pattern="(?:\s*&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;\s*)++|\s+"/>
			<filter class="solr.ISOLatin1AccentFilterFactory"/>
			...
{code}

without the XMLEncoding the RegEx is:

			{{(?:\s*</?\w+((\s+\w+(\s*=\s*(?:"*?&"'.*?'|[^'">\s]+))?)+\s*|\s*)/?>\s*)++|\s+}}

and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets.

Of course this is just a hack: It doesn't have any real understanding of HTML or XML syntax, so something invalid like </closing attr="x"/> will get matched.  On the other hand, < and > in text will be left alone.  

Also note it doesn't decode XML or HTML numeric or symbolic entity references, as HTMLStripReader does (my indexer is pre-decoding the entity references before sending the text to Solr for indexing).

So fixing HTMLStripReader and its dependent HTMLStripXXXTokenizers to do the right thing with offsets would still be a worthy task.  I wonder whether recasting HTMLStripReader using the org.apache.lucene.analysis.standard.CharStream interface would make sense for this?

      was (Author: skeptikos):
    Here is the workaround I am using, along with a long comment explaining why:

{code:title=solrconfig.xml}
		<!--
			Special-case stuff for HTML tags in Abstract field:
			Originally we had
				<tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"/>
			but pre-stripping destroys offsets needed for highiighting.
			Tried an HTML tag-extraction RegEx as a post-process
				<tokenizer class="solr.WhitespaceTokenizerFactory"/>
				<filter class="solr.PatternReplaceFilterFactory"
					pattern="&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;"
					replacement=""
					replace="all"/>
			but it still doesn't adjust the offset and the subsequent WDF then created havoc.
			One solution is to split on whitespace or tag delimiters (making tags
			into text), and either index the tags or use StopFilter to remove 'em.
			But the chosen solution is to swallow an entire chain of tags and any whitespace
			which surrounds or separates them, leaving non-HTML < and > intact, or else runs
			of whitespace as normal.

		-->
			<tokenizer class="solr.PatternTokenizerFactory"
					pattern="(?:\s*&lt;/?\w+((\s+\w+(\s*=\s*(?:&quot;.*?&quot;|'.*?'|[^'&quot;>\s]+))?)+\s*|\s*)/?&gt;\s*)++|\s+"/>
			<filter class="solr.ISOLatin1AccentFilterFactory"/>
			...
{code}

without the XMLEncoding the RegEx is:

			{{(?:\s*</?\w+((\s+\w+(\s*=\s*(?:"*?&"'.*?'|[^'">\s]+))?)+\s*|\s*)/?>\s*)++|\s+}}

and it will treat runs of "things that look like HTML/XML open or close tags with optional attributes, optionally preceded or followed by spaces" identically to "runs of one or more spaces" as token delimiters, and swallow them up, so the previous and following tokens have the correct offsets.
  
> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>            Reporter: Andrew May
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated SOLR-42:
--------------------------------

    Priority: Minor  (was: Major)

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Priority: Minor
>         Attachments: HTMLStripReaderTest.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (SOLR-42) Highlighting problems with HTMLStripWhitespaceTokenizerFactory

Posted by "Chris Harris (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/SOLR-42?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Harris updated SOLR-42:
-----------------------------

    Attachment: TokenPrinter.java
                HtmlStripReaderTestXmlProcessing.patch

The committed HtmlStripReader doesn't seem to handle offsets correctly for XML processing instructions such as this:

    <?xml version="1.0" encoding="UTF-8" ?>

I'm attaching two files:

HtmlStripReaderTestXmlProcessing.patch adds an HtmlStripReader test case to catch the problem. (The test currently fails.)

TokenPrinter.java can help make it a little clearer what the problem actually is. Here is the output if I run it against against the analysis code in trunk. Note that the offsets are basically what one would expect, except in the XML processing instructions case, where the start position is off by one:

-------------------------------------
String to test: <uniqueKey>id</uniqueKey>
 Token info:
   token 'id'
     startOffset: 11
     char at startOffset, and next few: 'id</u'
-------------------------------------
String to test: <!-- Unless this field is marked with required="false", it will be a required field -->
<uniqueKey>id</uniqueKey>
 Token info:
   token 'id'
     startOffset: 99
     char at startOffset, and next few: 'id</u'
-------------------------------------
String to test: <!-- And now: two elements --> <element1>one</element1>
 <element2>two</element2>
 Token info:
   token 'one'
     startOffset: 41
     char at startOffset, and next few: 'one</'
   token 'two'
     startOffset: 68
     char at startOffset, and next few: 'two</'
-------------------------------------
String to test: <?xml version="1.0" encoding="UTF-8" ?><uniqueKey>id</uniqueKey>
 Token info:
   token 'id'
     startOffset: 49
     char at startOffset, and next few: '>id</'
-------------------------------------

> Highlighting problems with HTMLStripWhitespaceTokenizerFactory
> --------------------------------------------------------------
>
>                 Key: SOLR-42
>                 URL: https://issues.apache.org/jira/browse/SOLR-42
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>            Reporter: Andrew May
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: htmlStripReaderTest.html, HTMLStripReaderTest.java, HtmlStripReaderTestXmlProcessing.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, SOLR-42.patch, TokenPrinter.java
>
>
> Indexing content that contains HTML markup, causes problems with highlighting if the HTMLStripWhitespaceTokenizerFactory is used (to prevent the tag names from being searchable).
> Example title field:
> <SUP>40</SUP>Ar/<SUP>39</SUP>Ar laserprobe dating of mylonitic fabrics in a polyorogenic terrane of NW Iberia
> Searching for title:fabrics with highlighting on, the highlighted version has the <em> tags in the wrong place - 22 characters to the left of where they should be (i.e. the sum of the lengths of the tags).
> Response from Yonik on the solr-user mailing-list:
> HTMLStripWhitespaceTokenizerFactory works in two phases...
> HTMLStripReader removes the HTML and passes the result to
> WhitespaceTokenizer... at that point, Tokens are generated, but the
> offsets will correspond to the text after HTML removal, not before.
> I did it this way so that HTMLStripReader  could go before any
> tokenizer (like StandardTokenizer).
> Can you open a JIRA bug for this?  The fix would be a special version
> of HTMLStripReader integrated with a WhitespaceTokenizer to keep
> offsets correct. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.