You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by dabboo <ag...@sapient.com> on 2009/03/17 12:31:57 UTC
Special Characters search in solr
Hi,
I am searching with any query string, which contains special characters like
è in it. for e.g. If I search for tèst then it shud return all the results
which contains tèst and test etc. There are other special characters also.
I have updated my server.xml file of tomcat server and included UTF-8 as
encoding type in the server entry but still it is not working.
Please suggest.
Thanks,
Amit Garg
--
View this message in context: http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Special Characters search in solr
Posted by Chris Hostetter <ho...@fucit.org>.
: Yes, I did and below is my debugQuery result.
before you even look at the debug section, look at the params section in
the responseHeader...
: <str name="q">Colo�</str>
the raw value Solr is getting from your servlet container doesn't match
what you think you are sending...
: It is actually converting "Coloèr" to "Colo�" and hence not searching. It is
...i'm guessing that either your servlet container is missconfigured for
dealing with UTF-8 characters, or your client code is doing something not
quite right ... untill you get that value you expect to see coming back in
that responseHeader, there's no point in fiddling with your schema.
-Hoss
Re: Special Characters search in solr
Posted by dabboo <ag...@sapient.com>.
Yes, I did and below is my debugQuery result.
<?xml version="1.0" encoding="UTF-8" ?>
- <response>
- <lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">47</int>
- <lst name="params">
<str name="rows">10</str>
<str name="start">0</str>
<str name="indent">on</str>
<str name="q">Colo�</str>
<str name="qt">dismaxrequest</str>
<str name="debugQuery">true</str>
<str name="version">2.2</str>
</lst>
</lst>
<result name="response" numFound="0" start="0" maxScore="0.0" />
- <lst name="debug">
<str name="rawquerystring">Colo�</str>
<str name="querystring">Colo�</str>
<str
name="parsedquery">+DisjunctionMaxQuery((programJacketImage_program_s:colo |
courseCodeSeq_course_s:colo | authorLastName_product_s:colo |
era_product_s:colo | Index_Type_s:colo | prdMainTitle_s:colo |
discCode_course_s:colo | sourceGroupName_course_s:colo |
indexType_course_s:colo | prdMainTitle_product_s:colo |
isbn10_product_s:colo | displayName_course_s:colo | groupNm_program_s:colo |
discipline_product_s:colo | courseJacketImage_course_s:colo |
imprint_product_s:colo | introText_program_s:colo |
productType_product_s:colo | isbn13_product_s:colo |
copyrightYear_product_s:colo | prdPubDate_product_s:colo |
programType_program_s:colo | editor_product_s:colo |
courseType_course_s:colo | courseId_course_s:colo |
categoryIds_product_s:colo | contentType_product_s:colo |
indexType_program_s:colo | strapline_product_s:colo |
subCompany_course_s:colo | aluminator_product_s:colo | readBy_product_s:colo
| subject_product_s:colo | edition_product_s:colo | IndexId_s:colo |
programId_program_s:colo)~0.01) () all:english^90.0 all:hindi^123.0
all:glorious^2000.0 all:highlight^1.0E7 all:math^100.0 all:ab^12.0
all:erer^4545.0</str>
<str name="parsedquery_toString">+(programJacketImage_program_s:colo |
courseCodeSeq_course_s:colo | authorLastName_product_s:colo |
era_product_s:colo | Index_Type_s:colo | prdMainTitle_s:colo |
discCode_course_s:colo | sourceGroupName_course_s:colo |
indexType_course_s:colo | prdMainTitle_product_s:colo |
isbn10_product_s:colo | displayName_course_s:colo | groupNm_program_s:colo |
discipline_product_s:colo | courseJacketImage_course_s:colo |
imprint_product_s:colo | introText_program_s:colo |
productType_product_s:colo | isbn13_product_s:colo |
copyrightYear_product_s:colo | prdPubDate_product_s:colo |
programType_program_s:colo | editor_product_s:colo |
courseType_course_s:colo | courseId_course_s:colo |
categoryIds_product_s:colo | contentType_product_s:colo |
indexType_program_s:colo | strapline_product_s:colo |
subCompany_course_s:colo | aluminator_product_s:colo | readBy_product_s:colo
| subject_product_s:colo | edition_product_s:colo | IndexId_s:colo |
programId_program_s:colo)~0.01 () all:english^90.0 all:hindi^123.0
all:glorious^2000.0 all:highlight^1.0E7 all:math^100.0 all:ab^12.0
all:erer^4545.0</str>
<lst name="explain" />
<str name="QParser">DismaxQParser</str>
It is actually converting "Coloèr" to "Colo�" and hence not searching. It is
behaving the same even before adding the ISOLatin1AccentFilter.
Please suggest.
Thanks,
Amit Garg
Erick Erickson wrote:
>
> Did you reindex after you incorporated the ISOLatin... filter?
>
> On Tue, Mar 17, 2009 at 8:40 AM, dabboo <ag...@sapient.com> wrote:
>
>>
>> This is the entry in schema.xml
>>
>> <fieldType name="text" class="solr.TextField"
>> positionIncrementGap="100"
>> omitNorms="true">
>> <analyzer type="index">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <!--tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory"
>> /-->
>> <!-- in this example, we will only use synonyms at query time
>> <filter class="solr.SynonymFilterFactory"
>> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>> -->
>> <!-- Case insensitive stop word removal.
>> enablePositionIncrements=true ensures that a 'gap' is left to
>> allow for accurate phrase queries.
>> -->
>> <filter class="solr.StopFilterFactory"
>> ignoreCase="true"
>> words="stopwords.txt"
>> enablePositionIncrements="true"
>> />
>> <filter class="solr.ISOLatin1AccentFilterFactory"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> <analyzer class="org.apache.lucene.analysis.ru.RussianAnalyzer"/>
>>
>> </analyzer>
>> <analyzer type="query">
>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>> <filter class="solr.ISOLatin1AccentFilterFactory"/>
>> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
>> ignoreCase="true" expand="true"/>
>> <filter class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt"/>
>> <filter class="solr.WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.LowerCaseFilterFactory"/>
>> <filter class="solr.EnglishPorterFilterFactory"
>> protected="protwords.txt"/>
>> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>> <!--analyzer
>> class="org.apache.lucene.analysis.ru.RussianAnalyzer"/-->
>> <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
>> outputUnigramIfNoNgram="true" maxShingleSize="99"/>
>>
>>
>> </analyzer>
>> </fieldType>
>>
>>
>>
>> dabboo wrote:
>> >
>> > I have added this filter factory in my schema.xml also but still that
>> is
>> > not working. I am sorry but I didnt get as how to create the field to
>> > handle the accents.
>> >
>> > Please help.
>> >
>> >
>> > Grant Ingersoll-6 wrote:
>> >>
>> >> You will need to create a field that handles the accents in order to
>> >> do this. Start by looking at the ISOLatin1AccentFilter.
>> >>
>> >> -Grant
>> >>
>> >> On Mar 17, 2009, at 7:31 AM, dabboo wrote:
>> >>
>> >>>
>> >>> Hi,
>> >>>
>> >>> I am searching with any query string, which contains special
>> >>> characters like
>> >>> è in it. for e.g. If I search for tèst then it shud return all the
>> >>> results
>> >>> which contains tèst and test etc. There are other special characters
>> >>> also.
>> >>>
>> >>> I have updated my server.xml file of tomcat server and included
>> >>> UTF-8 as
>> >>> encoding type in the server entry but still it is not working.
>> >>>
>> >>> Please suggest.
>> >>>
>> >>> Thanks,
>> >>> Amit Garg
>> >>> --
>> >>> View this message in context:
>> >>>
>> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
>> >>> Sent from the Solr - User mailing list archive at Nabble.com.
>> >>>
>> >>
>> >>
>> >>
>> >
>> >
>>
>> --
>> View this message in context:
>> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22558353.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
>
>
--
View this message in context: http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22559419.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Special Characters search in solr
Posted by Erick Erickson <er...@gmail.com>.
Did you reindex after you incorporated the ISOLatin... filter?
On Tue, Mar 17, 2009 at 8:40 AM, dabboo <ag...@sapient.com> wrote:
>
> This is the entry in schema.xml
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100"
> omitNorms="true">
> <analyzer type="index">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <!--tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory" /-->
> <!-- in this example, we will only use synonyms at query time
> <filter class="solr.SynonymFilterFactory"
> synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
> -->
> <!-- Case insensitive stop word removal.
> enablePositionIncrements=true ensures that a 'gap' is left to
> allow for accurate phrase queries.
> -->
> <filter class="solr.StopFilterFactory"
> ignoreCase="true"
> words="stopwords.txt"
> enablePositionIncrements="true"
> />
> <filter class="solr.ISOLatin1AccentFilterFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> <analyzer class="org.apache.lucene.analysis.ru.RussianAnalyzer"/>
>
> </analyzer>
> <analyzer type="query">
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.ISOLatin1AccentFilterFactory"/>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.EnglishPorterFilterFactory"
> protected="protwords.txt"/>
> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> <!--analyzer
> class="org.apache.lucene.analysis.ru.RussianAnalyzer"/-->
> <filter class="solr.ShingleFilterFactory" outputUnigrams="true"
> outputUnigramIfNoNgram="true" maxShingleSize="99"/>
>
>
> </analyzer>
> </fieldType>
>
>
>
> dabboo wrote:
> >
> > I have added this filter factory in my schema.xml also but still that is
> > not working. I am sorry but I didnt get as how to create the field to
> > handle the accents.
> >
> > Please help.
> >
> >
> > Grant Ingersoll-6 wrote:
> >>
> >> You will need to create a field that handles the accents in order to
> >> do this. Start by looking at the ISOLatin1AccentFilter.
> >>
> >> -Grant
> >>
> >> On Mar 17, 2009, at 7:31 AM, dabboo wrote:
> >>
> >>>
> >>> Hi,
> >>>
> >>> I am searching with any query string, which contains special
> >>> characters like
> >>> è in it. for e.g. If I search for tèst then it shud return all the
> >>> results
> >>> which contains tèst and test etc. There are other special characters
> >>> also.
> >>>
> >>> I have updated my server.xml file of tomcat server and included
> >>> UTF-8 as
> >>> encoding type in the server entry but still it is not working.
> >>>
> >>> Please suggest.
> >>>
> >>> Thanks,
> >>> Amit Garg
> >>> --
> >>> View this message in context:
> >>>
> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
> >>> Sent from the Solr - User mailing list archive at Nabble.com.
> >>>
> >>
> >>
> >>
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22558353.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>
Re: Special Characters search in solr
Posted by dabboo <ag...@sapient.com>.
This is the entry in schema.xml
<fieldType name="text" class="solr.TextField" positionIncrementGap="100"
omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!--tokenizer class="solr.HTMLStripWhitespaceTokenizerFactory" /-->
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory"
synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
enablePositionIncrements=true ensures that a 'gap' is left to
allow for accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<analyzer class="org.apache.lucene.analysis.ru.RussianAnalyzer"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<!--analyzer
class="org.apache.lucene.analysis.ru.RussianAnalyzer"/-->
<filter class="solr.ShingleFilterFactory" outputUnigrams="true"
outputUnigramIfNoNgram="true" maxShingleSize="99"/>
</analyzer>
</fieldType>
dabboo wrote:
>
> I have added this filter factory in my schema.xml also but still that is
> not working. I am sorry but I didnt get as how to create the field to
> handle the accents.
>
> Please help.
>
>
> Grant Ingersoll-6 wrote:
>>
>> You will need to create a field that handles the accents in order to
>> do this. Start by looking at the ISOLatin1AccentFilter.
>>
>> -Grant
>>
>> On Mar 17, 2009, at 7:31 AM, dabboo wrote:
>>
>>>
>>> Hi,
>>>
>>> I am searching with any query string, which contains special
>>> characters like
>>> è in it. for e.g. If I search for tèst then it shud return all the
>>> results
>>> which contains tèst and test etc. There are other special characters
>>> also.
>>>
>>> I have updated my server.xml file of tomcat server and included
>>> UTF-8 as
>>> encoding type in the server entry but still it is not working.
>>>
>>> Please suggest.
>>>
>>> Thanks,
>>> Amit Garg
>>> --
>>> View this message in context:
>>> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>
>>
>>
>
>
--
View this message in context: http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22558353.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Special Characters search in solr
Posted by dabboo <ag...@sapient.com>.
I have added this filter factory in my schema.xml also but still that is not
working. I am sorry but I didnt get as how to create the field to handle the
accents.
Please help.
Grant Ingersoll-6 wrote:
>
> You will need to create a field that handles the accents in order to
> do this. Start by looking at the ISOLatin1AccentFilter.
>
> -Grant
>
> On Mar 17, 2009, at 7:31 AM, dabboo wrote:
>
>>
>> Hi,
>>
>> I am searching with any query string, which contains special
>> characters like
>> è in it. for e.g. If I search for tèst then it shud return all the
>> results
>> which contains tèst and test etc. There are other special characters
>> also.
>>
>> I have updated my server.xml file of tomcat server and included
>> UTF-8 as
>> encoding type in the server entry but still it is not working.
>>
>> Please suggest.
>>
>> Thanks,
>> Amit Garg
>> --
>> View this message in context:
>> http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>
>
>
--
View this message in context: http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22558192.html
Sent from the Solr - User mailing list archive at Nabble.com.
Re: Special Characters search in solr
Posted by Grant Ingersoll <gs...@apache.org>.
You will need to create a field that handles the accents in order to
do this. Start by looking at the ISOLatin1AccentFilter.
-Grant
On Mar 17, 2009, at 7:31 AM, dabboo wrote:
>
> Hi,
>
> I am searching with any query string, which contains special
> characters like
> è in it. for e.g. If I search for tèst then it shud return all the
> results
> which contains tèst and test etc. There are other special characters
> also.
>
> I have updated my server.xml file of tomcat server and included
> UTF-8 as
> encoding type in the server entry but still it is not working.
>
> Please suggest.
>
> Thanks,
> Amit Garg
> --
> View this message in context: http://www.nabble.com/Special-Characters-search-in-solr-tp22557230p22557230.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>