You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tejinder Rawat <te...@gmail.com> on 2012/02/01 20:52:41 UTC

How to make search with special characters in keywords

Hi all,

In my implementation many fields in documents are having words with
special characters like "Company®" ,"Time™".

Index is created using these fields. However if I make search using
these keywords in solr console, it does not work.

i.e. entering "Company®" or "Time™" in search field box does not
return any document. Where as entering "Company" or "Time" returns
documents.

Requirement is to be able to make search with special characters in keywords.

Any pointers about how to index and search in case of special
characters will be greatly appreciated.  Thank you.


Thanks,
Tejinder

Re: How to make search with special characters in keywords

Posted by SUJIT PAL <su...@comcast.net>.
Well, sometimes people just copy-paste stuff into the search box probably because some words (at least in my world) are very hard to spell correctly. We noticed the problem because the query was getting mangled on its way in and not returning any search results even though it should have.

Our analysis chain (both query and index) uses ASCIIFoldingFilter to downcast these special characters to "equivalent" ASCII, so a string such as "Ångström" for example will actually result in a search for "angstrom". The indexing also does the same conversion.

The mangling looked very similar to what happens when UTF-8 is passed through ISO-8859-1 encoding (and vice versa) which led us to the solution.

-sujit

On Feb 1, 2012, at 5:04 PM, Erick Erickson wrote:

> Sujit's comments are well taken, part of your problem will certainly be
> getting the special characters through your container...
> 
> But another part of your problem will be having the characters in
> your index in the first place. The fact that you can find "Time" in
> the first place suggests that your index does NOT have the special
> characters, you need to look to your analysis chain to see
> what transformations occur, see the admin/analysis page...
> 
> But I question why you need to search on special characters. Do
> you really expect the user to be happy with being required to
> enter "Company®"? A common approach is to remove such
> special characters during both index and query analyzing so a
> "Company®" and "Company" are equivalent.
> 
> But your problem space may differ.
> 
> Best
> Erick
> 
> On Wed, Feb 1, 2012 at 6:55 PM, SUJIT PAL <su...@comcast.net> wrote:
>> Hi Tejinder,
>> 
>> I had this problem yesterday (believe it or not :-)), and the fix for us was to make Tomcat UTF-8 compliant. In server.xml, there is a <Controller> tag, we added the attribute URIEncoding="UTF-8" and restarted Tomcat. Not sure what container you are using, if its Tomcat this will solve it, else you could probably find a similar setting for your container. Here is a link that provides more specific info:
>> http://struts.apache.org/2.0.6/docs/how-to-support-utf-8-uriencoding-with-tomcat.html
>> 
>> -sujit
>> 
>> On Feb 1, 2012, at 11:52 AM, Tejinder Rawat wrote:
>> 
>>> Hi all,
>>> 
>>> In my implementation many fields in documents are having words with
>>> special characters like "Company®" ,"Time™".
>>> 
>>> Index is created using these fields. However if I make search using
>>> these keywords in solr console, it does not work.
>>> 
>>> i.e. entering "Company®" or "Time™" in search field box does not
>>> return any document. Where as entering "Company" or "Time" returns
>>> documents.
>>> 
>>> Requirement is to be able to make search with special characters in keywords.
>>> 
>>> Any pointers about how to index and search in case of special
>>> characters will be greatly appreciated.  Thank you.
>>> 
>>> 
>>> Thanks,
>>> Tejinder
>> 


Re: How to make search with special characters in keywords

Posted by Erick Erickson <er...@gmail.com>.
Sujit's comments are well taken, part of your problem will certainly be
getting the special characters through your container...

But another part of your problem will be having the characters in
your index in the first place. The fact that you can find "Time" in
the first place suggests that your index does NOT have the special
characters, you need to look to your analysis chain to see
what transformations occur, see the admin/analysis page...

But I question why you need to search on special characters. Do
you really expect the user to be happy with being required to
enter "Company®"? A common approach is to remove such
special characters during both index and query analyzing so a
"Company®" and "Company" are equivalent.

But your problem space may differ.

Best
Erick

On Wed, Feb 1, 2012 at 6:55 PM, SUJIT PAL <su...@comcast.net> wrote:
> Hi Tejinder,
>
> I had this problem yesterday (believe it or not :-)), and the fix for us was to make Tomcat UTF-8 compliant. In server.xml, there is a <Controller> tag, we added the attribute URIEncoding="UTF-8" and restarted Tomcat. Not sure what container you are using, if its Tomcat this will solve it, else you could probably find a similar setting for your container. Here is a link that provides more specific info:
> http://struts.apache.org/2.0.6/docs/how-to-support-utf-8-uriencoding-with-tomcat.html
>
> -sujit
>
> On Feb 1, 2012, at 11:52 AM, Tejinder Rawat wrote:
>
>> Hi all,
>>
>> In my implementation many fields in documents are having words with
>> special characters like "Company®" ,"Time™".
>>
>> Index is created using these fields. However if I make search using
>> these keywords in solr console, it does not work.
>>
>> i.e. entering "Company®" or "Time™" in search field box does not
>> return any document. Where as entering "Company" or "Time" returns
>> documents.
>>
>> Requirement is to be able to make search with special characters in keywords.
>>
>> Any pointers about how to index and search in case of special
>> characters will be greatly appreciated.  Thank you.
>>
>>
>> Thanks,
>> Tejinder
>

Re: How to make search with special characters in keywords

Posted by SUJIT PAL <su...@comcast.net>.
Hi Tejinder,

I had this problem yesterday (believe it or not :-)), and the fix for us was to make Tomcat UTF-8 compliant. In server.xml, there is a <Controller> tag, we added the attribute URIEncoding="UTF-8" and restarted Tomcat. Not sure what container you are using, if its Tomcat this will solve it, else you could probably find a similar setting for your container. Here is a link that provides more specific info:
http://struts.apache.org/2.0.6/docs/how-to-support-utf-8-uriencoding-with-tomcat.html

-sujit

On Feb 1, 2012, at 11:52 AM, Tejinder Rawat wrote:

> Hi all,
> 
> In my implementation many fields in documents are having words with
> special characters like "Company®" ,"Time™".
> 
> Index is created using these fields. However if I make search using
> these keywords in solr console, it does not work.
> 
> i.e. entering "Company®" or "Time™" in search field box does not
> return any document. Where as entering "Company" or "Time" returns
> documents.
> 
> Requirement is to be able to make search with special characters in keywords.
> 
> Any pointers about how to index and search in case of special
> characters will be greatly appreciated.  Thank you.
> 
> 
> Thanks,
> Tejinder