You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Anders Melchiorsen <ma...@cup.kalibalik.dk> on 2009/08/26 13:13:20 UTC

HTML decoder is splitting tokens

Hi.

When indexing the string "G&uuml;nther" with
HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two tokens,
"Gü" and "nther".

Is this a bug, or am I doing something wrong?

(Using a Solr nightly from 2009-05-29)


Anders.

Re: HTML decoder is splitting tokens

Posted by Anders Melchiorsen <ma...@spoon.kalibalik.dk>.

Koji Sekiguchi <ko...@r.email.ne.jp> writes:

> This is correct when you have the mapping definition:
>
> "&lt;" => "<"
> "&gt;" => ">"
>    :              :
>
> But I thought you could not have them, but have only:
>
> "&uuml;" => "ü"
> "&auml;" => "ä"
>    :             :
>
> Didn't it solve your problem?

Hi Koji,

oh, seems like I missed a bit of your suggestion. So you propose to
have mappings for all entities except the troublesome lt, gt, amp?

That should work, as long as it is okay that whitespace follows those
characters. I guess that it will indeed be okay for most situations.

Still, while that is a clever workaround, it doesn't change that the
advertised functionality in the HTML stripper is broken.

I now signed up for JIRA, and created SOLR-1394 for this issue.

Thanks,
Anders.

Re: HTML decoder is splitting tokens

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Anders,

Thank you for the explanation.

 > which could be written in HTML like this:
 >
 > use <tt>&lt;p&gt;</tt> to mark a paragraph

Ok.

 > so the mapping char filter would map it into:
 >
 > use <tt><p></tt> to mark a paragraph

This is correct when you have the mapping definition:

"&lt;" => "<"
"&gt;" => ">"
    :              :

But I thought you could not have them, but have only:

"&uuml;" => "ü"
"&auml;" => "ä"
    :             :

Didn't it solve your problem?

Thank you,

Koji

Anders Melchiorsen wrote:
> Koji Sekiguchi <ko...@r.email.ne.jp> writes:
>
>   
>> Thank you for attaching the patch. Sorry again, I don't have enough
>> time to investigate the patch and the problem you have, though, I'd
>> like just to recommend that you'd open a JIRA issue and attach the
>> patch so that I or someone can look into it later.
>>     
>
> Sorry, learning an issue tracker every time I find a bug in some
> project is too much trouble. I wouldn't mind if someone else transfers
> my previous mail, though.
>
>
>   
>> And I didn't understand this part of your previous mail:
>>
>>     
>>> Adding MappingCharFilterFactory in front of the HTML stripper (so
>>> that the latter will not see the entity) does work as expected.
>>> That is, until I try strings like "use &lt;p&gt; to mark a
>>> paragraph", where the HTML stripper will then remove parts of the
>>> actual text. So this approach will not work.
>>>       
>
> Entity mapping and tag removal has to happen in one pass to keep
> fidelity.
>
> Let's say that we are analyzing a tutorial on writing HTML. It might
> contain the text:
>
>     use <p> to mark a paragraph
>
> which could be written in HTML like this:
>
>     use <tt>&lt;p&gt;</tt> to mark a paragraph
>
> so the mapping char filter would map it into:
>
>     use <tt><p></tt> to mark a paragraph
>
> which is already wrong. Next, the HTML stripper would remove the tags:
>
>     use to mark a paragraph
>
> and we have now lost a part of the original text.
>
>
> Cheers,
> Anders.
>
>

Re: HTML decoder is splitting tokens

Posted by Anders Melchiorsen <ma...@spoon.kalibalik.dk>.

Koji Sekiguchi <ko...@r.email.ne.jp> writes:

> Thank you for attaching the patch. Sorry again, I don't have enough
> time to investigate the patch and the problem you have, though, I'd
> like just to recommend that you'd open a JIRA issue and attach the
> patch so that I or someone can look into it later.

Sorry, learning an issue tracker every time I find a bug in some
project is too much trouble. I wouldn't mind if someone else transfers
my previous mail, though.

> And I didn't understand this part of your previous mail:
>
>> Adding MappingCharFilterFactory in front of the HTML stripper (so
>> that the latter will not see the entity) does work as expected.
>> That is, until I try strings like "use &lt;p&gt; to mark a
>> paragraph", where the HTML stripper will then remove parts of the
>> actual text. So this approach will not work.

Entity mapping and tag removal has to happen in one pass to keep
fidelity.

Let's say that we are analyzing a tutorial on writing HTML. It might
contain the text:

    use <p> to mark a paragraph

which could be written in HTML like this:

    use <tt>&lt;p&gt;</tt> to mark a paragraph

so the mapping char filter would map it into:

    use <tt><p></tt> to mark a paragraph

which is already wrong. Next, the HTML stripper would remove the tags:

    use to mark a paragraph

and we have now lost a part of the original text.

Cheers,
Anders.

Re: HTML decoder is splitting tokens

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Anders,

Thank you for attaching the patch. Sorry again, I don't have
enough time to investigate the patch and the problem you have,
though, I'd like just to recommend that you'd open a JIRA issue
and attach the patch so that I or someone can look into it later.

And I didn't understand this part of your previous mail:

 > Adding MappingCharFilterFactory in front of the HTML stripper (so
 > that the latter will not see the entity) does work as expected. That
 > is, until I try strings like "use &lt;p&gt; to mark a paragraph",
 > where the HTML stripper will then remove parts of the actual text.
 > So this approach will not work.

Thanks,

Koji

Anders Melchiorsen wrote:
> Greetings.
>
> I am moving this issue from the solr-user list. As can be seen in the
> messages below, I am having problems with the Solr HTML stripper.
>
> After some investigation, I have found the cause to be that the
> stripper is replacing the removed HTML with spaces. This obviously
> breaks when the HTML is in the middle of a word, like "G&uuml;nther".
>
> So, without knowing what I was doing, I hacked together a fix that
> uses offset correction instead.
>
> That seemed to work, except that closing tags and attributes still
> broke the positioning. With even less of a clue, I replaced read()
> with next() in the two methods handling those.
>
> Finally, invalid HTML also gave wrong offsets, and I fixed that by
> restoring numRead when rolling back the input stream.
>
> At this point I stopped trying to break it, so there may still be more
> problems. Or I might have introduced some problem on my own. Anyway, I
> have put the three patches at the bottom of this mail, in case
> somebody wants to move along with this issue.
>
>
>
> Regards,
> Anders.
>
>
>
> "Anders Melchiorsen" <ma...@spoon.kalibalik.dk> writes:
>
>   
>> Hello.
>>
>> Thanks for the hints. Still some trouble, though.
>>
>> I added just the HTMLStripCharFilterFactory because, according to
>> documentation, it should also replace HTML entities. It did, but
>> still left a space after the entity, so I got two tokens from
>> "G&uuml;nther". That seems like a bug?
>>
>> Adding MappingCharFilterFactory in front of the HTML stripper (so
>> that the latter will not see the entity) does work as expected. That
>> is, until I try strings like "use &lt;p&gt; to mark a paragraph",
>> where the HTML stripper will then remove parts of the actual text.
>> So this approach will not work.
>>
>>
>> Finally, I was happy that I could now use an arbitrary tokenizer
>> with HTML input. The PatternTokenizer, however, seems to be using
>> character offsets corresponding to the output of the char filters,
>> and so the highlighting markers end up at the wrong place. Is that a
>> bug, or a configuration issue?
>>
>>
>> Cheers,
>> Anders.
>>
>>
>> Koji Sekiguchi wrote:
>>     
>>> Hi Anders,
>>>
>>> Sorry, I don't know this is a bug or a feature, but
>>> I'd like to show an alternate way if you'd like.
>>>
>>> In Solr trunk, HTMLStripWhitespaceTokenizerFactory is
>>> marked as deprecated. Instead, HTMLStripCharFilterFactory and
>>> an arbitrary TokenizerFactory are encouraged to use.
>>> And I'd recommend you to use MappingCharFilterFactory
>>> to convert character references to real characters.
>>> That is, you have:
>>>
>>> <fieldType name="textHtml" class="solr.TextField" >
>>>   <analyzer>
>>>     <charFilter class="solr.MappingCharFilterFactory"
>>> mapping="mapping.txt"/>
>>>     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>   </analyzer>
>>> </fieldType>
>>>
>>> where the contents of mapping.txt:
>>>
>>> "&uuml;" => "ü"
>>> "&auml;" => "ä"
>>> "&iuml;" => "ï"
>>> "&euml;" => "ë"
>>> "&ouml;" => "ö"
>>>     :             :
>>>
>>> Then run analysis.jsp and see the result.
>>>
>>> Thank you,
>>>
>>> Koji
>>>
>>>
>>> Anders Melchiorsen wrote:
>>>       
>>>> Hi.
>>>>
>>>> When indexing the string "G&uuml;nther" with
>>>> HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two
>>>> tokens, "Gü" and "nther".
>>>>
>>>> Is this a bug, or am I doing something wrong?
>>>>
>>>> (Using a Solr nightly from 2009-05-29)
>>>>
>>>>
>>>> Anders.
>>>>
>>>>         
>
>
> commit 1fb2d42181d8effb1b444aa2fa02d86df1d860d7
> Author: Anders Melchiorsen <ma...@spoon.kalibalik.dk>
> Date:   Fri Aug 28 15:57:03 2009 +0200
>
>     Use offset correction instead of inserting spaces into the stream
>     
>     Fixes "G&uuml;nther" turning into "Gü     nther".
>
> diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
> index 733d783..e473cef 100644
> --- a/HTMLStripCharFilter.java
> +++ b/HTMLStripCharFilter.java
> @@ -37,7 +37,9 @@ public class HTMLStripCharFilter extends BaseCharFilter {
>    private int readAheadLimit = DEFAULT_READ_AHEAD;
>    private int safeReadAheadLimit = readAheadLimit - 3;
>    private int numWhitespace = 0;
> +  private int numWhitespaceCorrected = 0;
>    private int numRead = 0;
> +  private int numReadLast = 0;
>    private int lastMark;
>    private Set<String> escapedTags;
>  
> @@ -674,9 +676,11 @@ public class HTMLStripCharFilter extends BaseCharFilter {
>      // where do we have to worry about them?
>      // <![ CDATA [ unescaped markup ]]>
>      if (numWhitespace > 0){
> -      numWhitespace--;
> -      return ' ';
> +      addOffCorrectMap(numReadLast+1-numWhitespaceCorrected, numWhitespaceCorrected+numWhitespace);
> +      numWhitespaceCorrected += numWhitespace;
> +      numWhitespace = 0;
>      }
> +    numReadLast = numRead;
>      //do not limit this one by the READAHEAD
>      while(true) {
>        int lastNumRead = numRead;
>
> commit 542f5734136bbfd72ae802c30b6c61361268bccf
> Author: Anders Melchiorsen <ma...@spoon.kalibalik.dk>
> Date:   Fri Aug 28 15:57:29 2009 +0200
>
>     Use next() in place of read()
>     
>     The read() method is our public interface, while next()
>     is what we use internally to get the next character.
>
> diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
> index e473cef..ab14de5 100644
> --- a/HTMLStripCharFilter.java
> +++ b/HTMLStripCharFilter.java
> @@ -537,13 +537,13 @@ public class HTMLStripCharFilter extends BaseCharFilter {
>  
>    private int readName(boolean checkEscaped) throws IOException {
>      StringBuilder builder = (checkEscaped && escapedTags!=null) ? new StringBuilder() : null;
> -    int ch = read();
> +    int ch = next();
>      if (builder!=null) builder.append((char)ch);
>      if (!isFirstIdChar(ch)) return MISMATCH;
> -    ch = read();
> +    ch = next();
>      if (builder!=null) builder.append((char)ch);
>      while(isIdChar(ch)) {
> -      ch=read();
> +      ch = next();
>        if (builder!=null) builder.append((char)ch);
>      }
>      if (ch!=-1) {
> @@ -572,11 +572,11 @@ public class HTMLStripCharFilter extends BaseCharFilter {
>      //  <a href="a/<!--#echo "path"-->">
>      private int readAttr2() throws IOException {
>      if ((numRead - lastMark < safeReadAheadLimit)) {
> -      int ch = read();
> +      int ch = next();
>        if (!isFirstIdChar(ch)) return MISMATCH;
> -      ch = read();
> +      ch = next();
>        while(isIdChar(ch) && ((numRead - lastMark) < safeReadAheadLimit)){
> -        ch=read();
> +        ch = next();
>        }
>        if (isSpace(ch)) ch = nextSkipWS();
>  
>
> commit fdaa0920e2dceeb33e534138fe4a672914aff0ea
> Author: Anders Melchiorsen <ma...@spoon.kalibalik.dk>
> Date:   Fri Aug 28 16:31:34 2009 +0200
>
>     Restore the numRead variable when rolling back the stream
>     
>     This fixes offset corrections after invalid HTML input, like
>     "hi &<< <b>there</b>".
>
> diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
> index ab14de5..4bfa85b 100644
> --- a/HTMLStripCharFilter.java
> +++ b/HTMLStripCharFilter.java
> @@ -171,6 +171,7 @@ public class HTMLStripCharFilter extends BaseCharFilter {
>    private void restoreState() throws IOException {
>      input.reset();
>      pushed.setLength(0);
> +    numRead = lastMark;
>    }
>  
>    private int readNumericEntity() throws IOException {
>
> commit 571537795af2edb54543db2f71550662b0a18e60
> Author: Anders Melchiorsen <am...@gnu.jobsafari.dk>
> Date:   Fri Aug 28 23:49:03 2009 +0200
>
>     Update some tests.
>
> diff --git a/HTMLStripCharFilterTest.java b/HTMLStripCharFilterTest.java
> index 7be7c7e..4730830 100644
> --- a/HTMLStripCharFilterTest.java
> +++ b/HTMLStripCharFilterTest.java
> @@ -49,9 +49,9 @@ public class HTMLStripCharFilterTest extends TestCase {
>      String html = "<div class=\"foo\">this is some text</div> here is a <a href=\"#bar\">link</a> and " +
>              "another <a href=\"http://lucene.apache.org/\">link</a>. " +
>              "This is an entity: &amp; plus a &lt;.  Here is an &. <!-- is a comment -->";
> -    String gold = "                 this is some text       here is a                link     and " +
> -            "another                                     link    . " +
> -            "This is an entity: &     plus a <   .  Here is an &.                      ";
> +    String gold = " this is some text  here is a  link  and " +
> +            "another  link . " +
> +            "This is an entity: & plus a <.  Here is an &. ";
>      HTMLStripCharFilter reader = new HTMLStripCharFilter(CharReader.get(new StringReader(html)));
>      StringBuilder builder = new StringBuilder();
>      int ch = -1;
> @@ -87,7 +87,7 @@ public class HTMLStripCharFilterTest extends TestCase {
>  
>    public void testGamma() throws Exception {
>      String test = "&Gamma;";
> -    String gold = "\u0393      ";
> +    String gold = "\u0393";
>      Set<String> set = new HashSet<String>();
>      set.add("reserved");
>      Reader reader = new HTMLStripCharFilter(CharReader.get(new StringReader(test)), set);
> @@ -104,7 +104,7 @@ public class HTMLStripCharFilterTest extends TestCase {
>  
>    public void testEntities() throws Exception {
>      String test = "&nbsp; &lt;foo&gt; &#61; &Gamma; bar &#x393;";
> -    String gold = "       <   foo>    =     \u0393       bar \u0393     ";
> +    String gold = "  <foo> = \u0393 bar \u0393";
>      Set<String> set = new HashSet<String>();
>      set.add("reserved");
>      Reader reader = new HTMLStripCharFilter(CharReader.get(new StringReader(test)), set);
> @@ -121,7 +121,7 @@ public class HTMLStripCharFilterTest extends TestCase {
>  
>    public void testMoreEntities() throws Exception {
>      String test = "&nbsp; &lt;junk/&gt; &nbsp; &#33; &#64; and &#8217;";
> -    String gold = "       <   junk/>           !     @     and ’      ";
> +    String gold = "  <junk/>   ! @ and ’";
>      Set<String> set = new HashSet<String>();
>      set.add("reserved");
>      Reader reader = new HTMLStripCharFilter(CharReader.get(new StringReader(test)), set);
> @@ -264,7 +264,7 @@ public class HTMLStripCharFilterTest extends TestCase {
>    public void testComment() throws Exception {
>  
>      String test = "<!--- three dashes, still a valid comment ---> ";
> -    String gold = "                                               ";
> +    String gold = "  ";
>      Reader reader = new HTMLStripCharFilter(CharReader.get(new BufferedReader(new StringReader(test))));//force the use of BufferedReader
>      int ch = 0;
>      StringBuilder builder = new StringBuilder();
>
>

Re: HTML decoder is splitting tokens

Posted by Anders Melchiorsen <ma...@spoon.kalibalik.dk>.

Greetings.

I am moving this issue from the solr-user list. As can be seen in the
messages below, I am having problems with the Solr HTML stripper.

After some investigation, I have found the cause to be that the
stripper is replacing the removed HTML with spaces. This obviously
breaks when the HTML is in the middle of a word, like "G&uuml;nther".

So, without knowing what I was doing, I hacked together a fix that
uses offset correction instead.

That seemed to work, except that closing tags and attributes still
broke the positioning. With even less of a clue, I replaced read()
with next() in the two methods handling those.

Finally, invalid HTML also gave wrong offsets, and I fixed that by
restoring numRead when rolling back the input stream.

At this point I stopped trying to break it, so there may still be more
problems. Or I might have introduced some problem on my own. Anyway, I
have put the three patches at the bottom of this mail, in case
somebody wants to move along with this issue.



Regards,
Anders.



"Anders Melchiorsen" <ma...@spoon.kalibalik.dk> writes:

> Hello.
>
> Thanks for the hints. Still some trouble, though.
>
> I added just the HTMLStripCharFilterFactory because, according to
> documentation, it should also replace HTML entities. It did, but
> still left a space after the entity, so I got two tokens from
> "G&uuml;nther". That seems like a bug?
>
> Adding MappingCharFilterFactory in front of the HTML stripper (so
> that the latter will not see the entity) does work as expected. That
> is, until I try strings like "use &lt;p&gt; to mark a paragraph",
> where the HTML stripper will then remove parts of the actual text.
> So this approach will not work.
>
>
> Finally, I was happy that I could now use an arbitrary tokenizer
> with HTML input. The PatternTokenizer, however, seems to be using
> character offsets corresponding to the output of the char filters,
> and so the highlighting markers end up at the wrong place. Is that a
> bug, or a configuration issue?
>
>
> Cheers,
> Anders.
>
>
> Koji Sekiguchi wrote:
>> Hi Anders,
>>
>> Sorry, I don't know this is a bug or a feature, but
>> I'd like to show an alternate way if you'd like.
>>
>> In Solr trunk, HTMLStripWhitespaceTokenizerFactory is
>> marked as deprecated. Instead, HTMLStripCharFilterFactory and
>> an arbitrary TokenizerFactory are encouraged to use.
>> And I'd recommend you to use MappingCharFilterFactory
>> to convert character references to real characters.
>> That is, you have:
>>
>> <fieldType name="textHtml" class="solr.TextField" >
>>   <analyzer>
>>     <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping.txt"/>
>>     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>   </analyzer>
>> </fieldType>
>>
>> where the contents of mapping.txt:
>>
>> "&uuml;" => "ü"
>> "&auml;" => "ä"
>> "&iuml;" => "ï"
>> "&euml;" => "ë"
>> "&ouml;" => "ö"
>>     :             :
>>
>> Then run analysis.jsp and see the result.
>>
>> Thank you,
>>
>> Koji
>>
>>
>> Anders Melchiorsen wrote:
>>> Hi.
>>>
>>> When indexing the string "G&uuml;nther" with
>>> HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two
>>> tokens, "Gü" and "nther".
>>>
>>> Is this a bug, or am I doing something wrong?
>>>
>>> (Using a Solr nightly from 2009-05-29)
>>>
>>>
>>> Anders.
>>>


commit 1fb2d42181d8effb1b444aa2fa02d86df1d860d7
Author: Anders Melchiorsen <ma...@spoon.kalibalik.dk>
Date:   Fri Aug 28 15:57:03 2009 +0200

    Use offset correction instead of inserting spaces into the stream
    
    Fixes "G&uuml;nther" turning into "Gü     nther".

diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index 733d783..e473cef 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -37,7 +37,9 @@ public class HTMLStripCharFilter extends BaseCharFilter {
   private int readAheadLimit = DEFAULT_READ_AHEAD;
   private int safeReadAheadLimit = readAheadLimit - 3;
   private int numWhitespace = 0;
+  private int numWhitespaceCorrected = 0;
   private int numRead = 0;
+  private int numReadLast = 0;
   private int lastMark;
   private Set<String> escapedTags;
 
@@ -674,9 +676,11 @@ public class HTMLStripCharFilter extends BaseCharFilter {
     // where do we have to worry about them?
     // <![ CDATA [ unescaped markup ]]>
     if (numWhitespace > 0){
-      numWhitespace--;
-      return ' ';
+      addOffCorrectMap(numReadLast+1-numWhitespaceCorrected, numWhitespaceCorrected+numWhitespace);
+      numWhitespaceCorrected += numWhitespace;
+      numWhitespace = 0;
     }
+    numReadLast = numRead;
     //do not limit this one by the READAHEAD
     while(true) {
       int lastNumRead = numRead;

commit 542f5734136bbfd72ae802c30b6c61361268bccf
Author: Anders Melchiorsen <ma...@spoon.kalibalik.dk>
Date:   Fri Aug 28 15:57:29 2009 +0200

    Use next() in place of read()
    
    The read() method is our public interface, while next()
    is what we use internally to get the next character.

diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index e473cef..ab14de5 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -537,13 +537,13 @@ public class HTMLStripCharFilter extends BaseCharFilter {
 
   private int readName(boolean checkEscaped) throws IOException {
     StringBuilder builder = (checkEscaped && escapedTags!=null) ? new StringBuilder() : null;
-    int ch = read();
+    int ch = next();
     if (builder!=null) builder.append((char)ch);
     if (!isFirstIdChar(ch)) return MISMATCH;
-    ch = read();
+    ch = next();
     if (builder!=null) builder.append((char)ch);
     while(isIdChar(ch)) {
-      ch=read();
+      ch = next();
       if (builder!=null) builder.append((char)ch);
     }
     if (ch!=-1) {
@@ -572,11 +572,11 @@ public class HTMLStripCharFilter extends BaseCharFilter {
     //  <a href="a/<!--#echo "path"-->">
     private int readAttr2() throws IOException {
     if ((numRead - lastMark < safeReadAheadLimit)) {
-      int ch = read();
+      int ch = next();
       if (!isFirstIdChar(ch)) return MISMATCH;
-      ch = read();
+      ch = next();
       while(isIdChar(ch) && ((numRead - lastMark) < safeReadAheadLimit)){
-        ch=read();
+        ch = next();
       }
       if (isSpace(ch)) ch = nextSkipWS();
 

commit fdaa0920e2dceeb33e534138fe4a672914aff0ea
Author: Anders Melchiorsen <ma...@spoon.kalibalik.dk>
Date:   Fri Aug 28 16:31:34 2009 +0200

    Restore the numRead variable when rolling back the stream
    
    This fixes offset corrections after invalid HTML input, like
    "hi &<< <b>there</b>".

diff --git a/HTMLStripCharFilter.java b/HTMLStripCharFilter.java
index ab14de5..4bfa85b 100644
--- a/HTMLStripCharFilter.java
+++ b/HTMLStripCharFilter.java
@@ -171,6 +171,7 @@ public class HTMLStripCharFilter extends BaseCharFilter {
   private void restoreState() throws IOException {
     input.reset();
     pushed.setLength(0);
+    numRead = lastMark;
   }
 
   private int readNumericEntity() throws IOException {

commit 571537795af2edb54543db2f71550662b0a18e60
Author: Anders Melchiorsen <am...@gnu.jobsafari.dk>
Date:   Fri Aug 28 23:49:03 2009 +0200

    Update some tests.

diff --git a/HTMLStripCharFilterTest.java b/HTMLStripCharFilterTest.java
index 7be7c7e..4730830 100644
--- a/HTMLStripCharFilterTest.java
+++ b/HTMLStripCharFilterTest.java
@@ -49,9 +49,9 @@ public class HTMLStripCharFilterTest extends TestCase {
     String html = "<div class=\"foo\">this is some text</div> here is a <a href=\"#bar\">link</a> and " +
             "another <a href=\"http://lucene.apache.org/\">link</a>. " +
             "This is an entity: &amp; plus a &lt;.  Here is an &. <!-- is a comment -->";
-    String gold = "                 this is some text       here is a                link     and " +
-            "another                                     link    . " +
-            "This is an entity: &     plus a <   .  Here is an &.                      ";
+    String gold = " this is some text  here is a  link  and " +
+            "another  link . " +
+            "This is an entity: & plus a <.  Here is an &. ";
     HTMLStripCharFilter reader = new HTMLStripCharFilter(CharReader.get(new StringReader(html)));
     StringBuilder builder = new StringBuilder();
     int ch = -1;
@@ -87,7 +87,7 @@ public class HTMLStripCharFilterTest extends TestCase {
 
   public void testGamma() throws Exception {
     String test = "&Gamma;";
-    String gold = "\u0393      ";
+    String gold = "\u0393";
     Set<String> set = new HashSet<String>();
     set.add("reserved");
     Reader reader = new HTMLStripCharFilter(CharReader.get(new StringReader(test)), set);
@@ -104,7 +104,7 @@ public class HTMLStripCharFilterTest extends TestCase {
 
   public void testEntities() throws Exception {
     String test = "&nbsp; &lt;foo&gt; &#61; &Gamma; bar &#x393;";
-    String gold = "       <   foo>    =     \u0393       bar \u0393     ";
+    String gold = "  <foo> = \u0393 bar \u0393";
     Set<String> set = new HashSet<String>();
     set.add("reserved");
     Reader reader = new HTMLStripCharFilter(CharReader.get(new StringReader(test)), set);
@@ -121,7 +121,7 @@ public class HTMLStripCharFilterTest extends TestCase {
 
   public void testMoreEntities() throws Exception {
     String test = "&nbsp; &lt;junk/&gt; &nbsp; &#33; &#64; and &#8217;";
-    String gold = "       <   junk/>           !     @     and ’      ";
+    String gold = "  <junk/>   ! @ and ’";
     Set<String> set = new HashSet<String>();
     set.add("reserved");
     Reader reader = new HTMLStripCharFilter(CharReader.get(new StringReader(test)), set);
@@ -264,7 +264,7 @@ public class HTMLStripCharFilterTest extends TestCase {
   public void testComment() throws Exception {
 
     String test = "<!--- three dashes, still a valid comment ---> ";
-    String gold = "                                               ";
+    String gold = "  ";
     Reader reader = new HTMLStripCharFilter(CharReader.get(new BufferedReader(new StringReader(test))));//force the use of BufferedReader
     int ch = 0;
     StringBuilder builder = new StringBuilder();

Re: HTML decoder is splitting tokens

Posted by Anders Melchiorsen <ma...@cup.kalibalik.dk>.

Hello.

Thanks for the hints. Still some trouble, though.

I added just the HTMLStripCharFilterFactory because, according to
documentation, it should also replace HTML entities. It did, but still
left a space after the entity, so I got two tokens from "G&uuml;nther".
That seems like a bug?

Adding MappingCharFilterFactory in front of the HTML stripper (so that the
latter will not see the entity) does work as expected. That is, until I
try strings like "use &lt;p&gt; to mark a paragraph", where the HTML
stripper will then remove parts of the actual text. So this approach will
not work.

Finally, I was happy that I could now use an arbitrary tokenizer with HTML
input. The PatternTokenizer, however, seems to be using character offsets
corresponding to the output of the char filters, and so the highlighting
markers end up at the wrong place. Is that a bug, or a configuration
issue?

Cheers,
Anders.

Koji Sekiguchi wrote:
> Hi Anders,
>
> Sorry, I don't know this is a bug or a feature, but
> I'd like to show an alternate way if you'd like.
>
> In Solr trunk, HTMLStripWhitespaceTokenizerFactory is
> marked as deprecated. Instead, HTMLStripCharFilterFactory and
> an arbitrary TokenizerFactory are encouraged to use.
> And I'd recommend you to use MappingCharFilterFactory
> to convert character references to real characters.
> That is, you have:
>
> <fieldType name="textHtml" class="solr.TextField" >
>   <analyzer>
>     <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping.txt"/>
>     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
>
> where the contents of mapping.txt:
>
> "&uuml;" => "ü"
> "&auml;" => "ä"
> "&iuml;" => "ï"
> "&euml;" => "ë"
> "&ouml;" => "ö"
>     :             :
>
> Then run analysis.jsp and see the result.
>
> Thank you,
>
> Koji
>
>
> Anders Melchiorsen wrote:
>> Hi.
>>
>> When indexing the string "G&uuml;nther" with
>> HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two tokens,
>> "Gü" and "nther".
>>
>> Is this a bug, or am I doing something wrong?
>>
>> (Using a Solr nightly from 2009-05-29)
>>
>>
>> Anders.
>>
>>
>>
>>
>
>

Re: HTML decoder is splitting tokens

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.

Hi Anders,

Sorry, I don't know this is a bug or a feature, but
I'd like to show an alternate way if you'd like.

In Solr trunk, HTMLStripWhitespaceTokenizerFactory is
marked as deprecated. Instead, HTMLStripCharFilterFactory and
an arbitrary TokenizerFactory are encouraged to use.
And I'd recommend you to use MappingCharFilterFactory
to convert character references to real characters.
That is, you have:

<fieldType name="textHtml" class="solr.TextField" >
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping.txt"/>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

where the contents of mapping.txt:

"&uuml;" => "ü"
"&auml;" => "ä"
"&iuml;" => "ï"
"&euml;" => "ë"
"&ouml;" => "ö"
    :             :

Then run analysis.jsp and see the result.

Thank you,

Koji

Anders Melchiorsen wrote:
> Hi.
>
> When indexing the string "G&uuml;nther" with
> HTMLStripWhitespaceTokenizerFactory (in analysis.jsp), I get two tokens,
> "Gü" and "nther".
>
> Is this a bug, or am I doing something wrong?
>
> (Using a Solr nightly from 2009-05-29)
>
>
> Anders.
>
>
>
>