You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Kundig, Andreas" <an...@wipo.int> on 2009/11/18 12:18:09 UTC

HTMLStripCharFilterFactory does not replace é

Hello

I indexed an html document with a decimal HTML Entity encodings: the character é (e with an acute accent) is encoded as &#233; The exact content of the document is:

<html><body>&#231;a va m&#233;m&#233; ?</body></html>

A search for 'mémé' returns no document. If I put the line above in solr admin's analysis.jsp it also doesn't match mémé. There is only a match if I replace &#233; by é .

This is how I configured the fieldType:

<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

I tried avoiding the problem by using the MappingCharFilterFactory:

<fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
  <analyzer>
    <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
  </analyzer>
</fieldType>

I put the file mapping.txt in the conf directory. It contains just this:

"&#233;" => "é"

This doesn't work either. How can I get this to work?
(I am using solr 1.4.0)

thank you
Andréas Kündig

World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.

RE: HTMLStripCharFilterFactory does not replace é

Posted by "Kundig, Andreas" <an...@wipo.int>.
It now works for me too. The problem was that tomcat was still working with an older version of the configuration. HTMLStripCharFilterFactory didn't even appear in analysis.jsp.

Thank you for looking into this.

Andréas

-----Original Message-----
From: Koji Sekiguchi [mailto:koji@r.email.ne.jp]
Sent: jeudi, 19. novembre 2009 06:59
To: solr-user@lucene.apache.org
Subject: Re: HTMLStripCharFilterFactory does not replace é

Your first definition of text_fr seems to be correct and should work
as expected. I tested it and worked fine ("mémé" was highlighted).

What was the output of HTMLStripCharFilterFactory in analysis.jsp?
In my analysis.jsp, I got "ça va mémé ?".

Koji


Kundig, Andreas wrote:
> Hello
>
> I indexed an html document with a decimal HTML Entity encodings: the character é (e with an acute accent) is encoded as &#233; The exact content of the document is:
>
> <html><body>&#231;a va m&#233;m&#233; ?</body></html>
>
> A search for 'mémé' returns no document. If I put the line above in solr admin's analysis.jsp it also doesn't match mémé. There is only a match if I replace &#233; by é .
>
> This is how I configured the fieldType:
>
> <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
>   <analyzer>
>     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
>
> I tried avoiding the problem by using the MappingCharFilterFactory:
>
> <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
>   <analyzer>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
>     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
>
> I put the file mapping.txt in the conf directory. It contains just this:
>
> "&#233;" => "é"
>
> This doesn't work either. How can I get this to work?
> (I am using solr 1.4.0)
>
> thank you
> Andréas Kündig
>
> World Intellectual Property Organization Disclaimer:
>
> This electronic message may contain privileged, confidential and
> copyright protected information. If you have received this e-mail
> by mistake, please immediately notify the sender and delete this
> e-mail and all its attachments. Please ensure all e-mail attachments
> are scanned for viruses prior to opening or using.
>
>


--
http://www.rondhuit.com/en/


World Intellectual Property Organization Disclaimer:

This electronic message may contain privileged, confidential and
copyright protected information. If you have received this e-mail
by mistake, please immediately notify the sender and delete this
e-mail and all its attachments. Please ensure all e-mail attachments
are scanned for viruses prior to opening or using.

Re: HTMLStripCharFilterFactory does not replace é

Posted by Koji Sekiguchi <ko...@r.email.ne.jp>.
Your first definition of text_fr seems to be correct and should work
as expected. I tested it and worked fine ("mémé" was highlighted).

What was the output of HTMLStripCharFilterFactory in analysis.jsp?
In my analysis.jsp, I got "ça va mémé ?".

Koji


Kundig, Andreas wrote:
> Hello
>
> I indexed an html document with a decimal HTML Entity encodings: the character é (e with an acute accent) is encoded as &#233; The exact content of the document is:
>
> <html><body>&#231;a va m&#233;m&#233; ?</body></html>
>
> A search for 'mémé' returns no document. If I put the line above in solr admin's analysis.jsp it also doesn't match mémé. There is only a match if I replace &#233; by é .
>
> This is how I configured the fieldType:
>
> <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
>   <analyzer>
>     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
>
> I tried avoiding the problem by using the MappingCharFilterFactory:
>
> <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100">
>   <analyzer>
>     <charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
>     <charFilter class="solr.HTMLStripCharFilterFactory"/>
>     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>   </analyzer>
> </fieldType>
>
> I put the file mapping.txt in the conf directory. It contains just this:
>
> "&#233;" => "é"
>
> This doesn't work either. How can I get this to work?
> (I am using solr 1.4.0)
>
> thank you
> Andréas Kündig
>
> World Intellectual Property Organization Disclaimer:
>
> This electronic message may contain privileged, confidential and
> copyright protected information. If you have received this e-mail
> by mistake, please immediately notify the sender and delete this
> e-mail and all its attachments. Please ensure all e-mail attachments
> are scanned for viruses prior to opening or using.
>
>   


-- 
http://www.rondhuit.com/en/