You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by dabboo <ag...@sapient.com> on 2009/03/17 12:31:57 UTC

Special Characters search in solr

Hi,

I am searching with any query string, which contains special characters like
è in it. for e.g. If I search for tèst then it shud return all the results
which contains tèst and test etc. There are other special characters also.

I have updated my server.xml file of tomcat server and included UTF-8 as
encoding type in the server entry but still it is not working.

Please suggest.

Thanks,
Amit Garg
-- 
View this message in context: http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Special Characters search in solr

Posted by Chris Hostetter <ho...@fucit.org>.

: Yes, I did and below is my debugQuery result.

before you even look at the debug section, look at the params section in 
the responseHeader...

:   <str name="q">Colo�</str> 

the raw value Solr is getting from your servlet container doesn't match 
what you think you are sending...

: It is actually converting "Coloèr" to "Colo�" and hence not searching. It is

...i'm guessing that either your servlet container is missconfigured for 
dealing with UTF-8 characters, or your client code is doing something not 
quite right ... untill you get that value you expect to see coming back in 
that responseHeader, there's no point in fiddling with your schema.


-Hoss

Re: Special Characters search in solr

Posted by dabboo <ag...@sapient.com>.

Yes, I did and below is my debugQuery result.

<?xml version="1.0" encoding="UTF-8" ?> 
- <response>
- <lst name="responseHeader">
  <int name="status">0</int> 
  <int name="QTime">47</int> 
- <lst name="params">
  <str name="rows">10</str> 
  <str name="start">0</str> 
  <str name="indent">on</str> 
  <str name="q">Colo�</str> 
  <str name="qt">dismaxrequest</str> 
  <str name="debugQuery">true</str> 
  <str name="version">2.2</str> 
  </lst>
  </lst>
  <result name="response" numFound="0" start="0" maxScore="0.0" /> 
- <lst name="debug">
  <str name="rawquerystring">Colo�</str> 
  <str name="querystring">Colo�</str> 
  <str
name="parsedquery">+DisjunctionMaxQuery((programJacketImage_program_s:colo |
courseCodeSeq_course_s:colo | authorLastName_product_s:colo |
era_product_s:colo | Index_Type_s:colo | prdMainTitle_s:colo |
discCode_course_s:colo | sourceGroupName_course_s:colo |
indexType_course_s:colo | prdMainTitle_product_s:colo |
isbn10_product_s:colo | displayName_course_s:colo | groupNm_program_s:colo |
discipline_product_s:colo | courseJacketImage_course_s:colo |
imprint_product_s:colo | introText_program_s:colo |
productType_product_s:colo | isbn13_product_s:colo |
copyrightYear_product_s:colo | prdPubDate_product_s:colo |
programType_program_s:colo | editor_product_s:colo |
courseType_course_s:colo | courseId_course_s:colo |
categoryIds_product_s:colo | contentType_product_s:colo |
indexType_program_s:colo | strapline_product_s:colo |
subCompany_course_s:colo | aluminator_product_s:colo | readBy_product_s:colo
| subject_product_s:colo | edition_product_s:colo | IndexId_s:colo |
programId_program_s:colo)~0.01) () all:english^90.0 all:hindi^123.0
all:glorious^2000.0 all:highlight^1.0E7 all:math^100.0 all:ab^12.0
all:erer^4545.0</str> 
  <str name="parsedquery_toString">+(programJacketImage_program_s:colo |
courseCodeSeq_course_s:colo | authorLastName_product_s:colo |
era_product_s:colo | Index_Type_s:colo | prdMainTitle_s:colo |
discCode_course_s:colo | sourceGroupName_course_s:colo |
indexType_course_s:colo | prdMainTitle_product_s:colo |
isbn10_product_s:colo | displayName_course_s:colo | groupNm_program_s:colo |
discipline_product_s:colo | courseJacketImage_course_s:colo |
imprint_product_s:colo | introText_program_s:colo |
productType_product_s:colo | isbn13_product_s:colo |
copyrightYear_product_s:colo | prdPubDate_product_s:colo |
programType_program_s:colo | editor_product_s:colo |
courseType_course_s:colo | courseId_course_s:colo |
categoryIds_product_s:colo | contentType_product_s:colo |
indexType_program_s:colo | strapline_product_s:colo |
subCompany_course_s:colo | aluminator_product_s:colo | readBy_product_s:colo
| subject_product_s:colo | edition_product_s:colo | IndexId_s:colo |
programId_program_s:colo)~0.01 () all:english^90.0 all:hindi^123.0
all:glorious^2000.0 all:highlight^1.0E7 all:math^100.0 all:ab^12.0
all:erer^4545.0</str> 
  <lst name="explain" /> 
  <str name="QParser">DismaxQParser</str> 


It is actually converting "Coloèr" to "Colo�" and hence not searching. It is
behaving the same even before adding the ISOLatin1AccentFilter.

Please suggest.

Thanks,
Amit Garg

Erick Erickson wrote:
> 
> Did you reindex after you incorporated the ISOLatin... filter?
> 
> On Tue, Mar 17, 2009 at 8:40 AM, dabboo <ag...@sapient.com> wrote:
> 
>>
>> This is the entry in schema.xml
>>
>>    <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100"
>> omitNorms="true">
>>      <analyzer type="index">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>        <!--tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"
>> /-->
>>        <!-- in this example, we will only use synonyms at query time
>>        <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>>        -->
>>        <!-- Case insensitive stop word removal.
>>             enablePositionIncrements=true ensures that a 'gap' is left to
>>             allow for accurate phrase queries.
>>        -->
>>        <filter class="solr.StopFilterFactory"
>>                ignoreCase="true"
>>                words="stopwords.txt"
>>                enablePositionIncrements="true"
>>                />
>>                <filter class="solr.ISOLatin1AccentFilterFactory"/>
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>        <analyzer class="org.apache.lucene.analysis.ru.RussianAnalyzer"/>
>>
>>      </analyzer>
>>      <analyzer type="query">
>>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>         <filter class="solr.ISOLatin1AccentFilterFactory"/>
>>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>>        <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>>        <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>        <filter class="solr.LowerCaseFilterFactory"/>
>>        <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/>
>>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>        <!--analyzer
>> class="org.apache.lucene.analysis.ru.RussianAnalyzer"/-->
>>         <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
>> outputUnigramIfNoNgram="true" maxShingleSize="99"/>
>>
>>
>>      </analyzer>
>>    </fieldType>
>>
>>
>>
>> dabboo wrote:
>> >
>> > I have added this filter factory in my schema.xml also but still that
>> is
>> > not working. I am sorry but I didnt get as how to create the field to
>> > handle the accents.
>> >
>> > Please help.
>> >
>> >
>> > Grant Ingersoll-6 wrote:
>> >>
>> >> You will need to create a field that handles the accents in order to
>> >> do this.  Start by looking at the ISOLatin1AccentFilter.
>> >>
>> >> -Grant
>> >>
>> >> On Mar 17, 2009, at 7:31 AM, dabboo wrote:
>> >>
>> >>>
>> >>> Hi,
>> >>>
>> >>> I am searching with any query string, which contains special
>> >>> characters like
>> >>> è in it. for e.g. If I search for tèst then it shud return all the
>> >>> results
>> >>> which contains tèst and test etc. There are other special characters
>> >>> also.
>> >>>
>> >>> I have updated my server.xml file of tomcat server and included
>> >>> UTF-8 as
>> >>> encoding type in the server entry but still it is not working.
>> >>>
>> >>> Please suggest.
>> >>>
>> >>> Thanks,
>> >>> Amit Garg
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
>> >>> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>>
>> >>
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22558353.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22559419.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Special Characters search in solr

Posted by Erick Erickson <er...@gmail.com>.

Did you reindex after you incorporated the ISOLatin... filter?

On Tue, Mar 17, 2009 at 8:40 AM, dabboo <ag...@sapient.com> wrote:

>
> This is the entry in schema.xml
>
>    <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
> omitNorms="true">
>      <analyzer type="index">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>        <!--tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory" /-->
>        <!-- in this example, we will only use synonyms at query time
>        <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>        -->
>        <!-- Case insensitive stop word removal.
>             enablePositionIncrements=true ensures that a 'gap' is left to
>             allow for accurate phrase queries.
>        -->
>        <filter class="solr.StopFilterFactory"
>                ignoreCase="true"
>                words="stopwords.txt"
>                enablePositionIncrements="true"
>                />
>                <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <analyzer class="org.apache.lucene.analysis.ru.RussianAnalyzer"/>
>
>      </analyzer>
>      <analyzer type="query">
>        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>         <filter class="solr.ISOLatin1AccentFilterFactory"/>
>        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>        <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
>        <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>        <filter class="solr.LowerCaseFilterFactory"/>
>        <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
>        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>        <!--analyzer
> class="org.apache.lucene.analysis.ru.RussianAnalyzer"/-->
>         <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
> outputUnigramIfNoNgram="true" maxShingleSize="99"/>
>
>
>      </analyzer>
>    </fieldType>
>
>
>
> dabboo wrote:
> >
> > I have added this filter factory in my schema.xml also but still that is
> > not working. I am sorry but I didnt get as how to create the field to
> > handle the accents.
> >
> > Please help.
> >
> >
> > Grant Ingersoll-6 wrote:
> >>
> >> You will need to create a field that handles the accents in order to
> >> do this.  Start by looking at the ISOLatin1AccentFilter.
> >>
> >> -Grant
> >>
> >> On Mar 17, 2009, at 7:31 AM, dabboo wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>> I am searching with any query string, which contains special
> >>> characters like
> >>> è in it. for e.g. If I search for tèst then it shud return all the
> >>> results
> >>> which contains tèst and test etc. There are other special characters
> >>> also.
> >>>
> >>> I have updated my server.xml file of tomcat server and included
> >>> UTF-8 as
> >>> encoding type in the server entry but still it is not working.
> >>>
> >>> Please suggest.
> >>>
> >>> Thanks,
> >>> Amit Garg
> >>> --
> >>> View this message in context:
> >>>
> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
> >>> Sent from the Solr - User mailing list archive at Nabble.com.
> >>>
> >>
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22558353.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: Special Characters search in solr

Posted by dabboo <ag...@sapient.com>.

This is the entry in schema.xml

    <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
omitNorms="true">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!--tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory" /-->
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
             enablePositionIncrements=true ensures that a 'gap' is left to
             allow for accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
		<filter class="solr.ISOLatin1AccentFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <analyzer class="org.apache.lucene.analysis.ru.RussianAnalyzer"/>
      
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.ISOLatin1AccentFilterFactory"/> 
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
        <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <!--analyzer
class="org.apache.lucene.analysis.ru.RussianAnalyzer"/-->
         <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
outputUnigramIfNoNgram="true" maxShingleSize="99"/>

   
      </analyzer>
    </fieldType>



dabboo wrote:
> 
> I have added this filter factory in my schema.xml also but still that is
> not working. I am sorry but I didnt get as how to create the field to
> handle the accents.
> 
> Please help.
> 
> 
> Grant Ingersoll-6 wrote:
>> 
>> You will need to create a field that handles the accents in order to  
>> do this.  Start by looking at the ISOLatin1AccentFilter.
>> 
>> -Grant
>> 
>> On Mar 17, 2009, at 7:31 AM, dabboo wrote:
>> 
>>>
>>> Hi,
>>>
>>> I am searching with any query string, which contains special  
>>> characters like
>>> è in it. for e.g. If I search for tèst then it shud return all the  
>>> results
>>> which contains tèst and test etc. There are other special characters  
>>> also.
>>>
>>> I have updated my server.xml file of tomcat server and included  
>>> UTF-8 as
>>> encoding type in the server entry but still it is not working.
>>>
>>> Please suggest.
>>>
>>> Thanks,
>>> Amit Garg
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>> 
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22558353.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Special Characters search in solr

Posted by dabboo <ag...@sapient.com>.

I have added this filter factory in my schema.xml also but still that is not
working. I am sorry but I didnt get as how to create the field to handle the
accents.

Please help.


Grant Ingersoll-6 wrote:
> 
> You will need to create a field that handles the accents in order to  
> do this.  Start by looking at the ISOLatin1AccentFilter.
> 
> -Grant
> 
> On Mar 17, 2009, at 7:31 AM, dabboo wrote:
> 
>>
>> Hi,
>>
>> I am searching with any query string, which contains special  
>> characters like
>> è in it. for e.g. If I search for tèst then it shud return all the  
>> results
>> which contains tèst and test etc. There are other special characters  
>> also.
>>
>> I have updated my server.xml file of tomcat server and included  
>> UTF-8 as
>> encoding type in the server entry but still it is not working.
>>
>> Please suggest.
>>
>> Thanks,
>> Amit Garg
>> -- 
>> View this message in context:
>> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22558192.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Special Characters search in solr

Posted by Grant Ingersoll <gs...@apache.org>.

You will need to create a field that handles the accents in order to  
do this.  Start by looking at the ISOLatin1AccentFilter.

-Grant

On Mar 17, 2009, at 7:31 AM, dabboo wrote:

>
> Hi,
>
> I am searching with any query string, which contains special  
> characters like
> è in it. for e.g. If I search for tèst then it shud return all the  
> results
> which contains tèst and test etc. There are other special characters  
> also.
>
> I have updated my server.xml file of tomcat server and included  
> UTF-8 as
> encoding type in the server entry but still it is not working.
>
> Please suggest.
>
> Thanks,
> Amit Garg
> -- 
> View this message in context: http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>