You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by It-forum <it...@meseo.fr> on 2013/05/22 18:09:10 UTC

Solr french search optimisation

Hello to all,

I'm trying to setup solr 4.2 to index and search into french content.

I defined a special fieldtype for french content :

         <fieldType name="text_fr" class="solr.TextField" 
positionIncrementGap="100">
                 <analyzer type="index">
                     <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                     <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="1" 
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
                     <filter class="solr.LowerCaseFilterFactory"/>
                     <filter class="solr.SnowballPorterFilterFactory" 
language="French" protected="protwords.txt"/>
                 </analyzer>

                 <analyzer type="query">
                     <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping-ISOLatin1Accent.txt"/>
                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                     <filter class="solr.WordDelimiterFilterFactory" 
generateWordParts="1" generateNumberParts="1" catenateWords="0" 
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
                     <filter class="solr.LowerCaseFilterFactory"/>
                     <filter class="solr.SnowballPorterFilterFactory" 
language="French" protected="protwords.txt"/>
                 </analyzer>
         </fieldType>


unfortunately, this field does not behave as I wish.

I'd like to be able to get results from unwell spelled word.

IE : I wish to get the same result typing "Pompe à chaleur" than typing 
"pomppe a chaler"  or with "solère" and "solaire"

I'm do not find the right way to create a fieldtype to reach this aim.

thanks in advance for your help, do not hesitate for more information if 
need.

Regards

David


Re: Solr french search optimisation

Posted by fbrisbart <fb...@bestofmedia.com>.
You can also think about using a SynonymFilter if you can list the
misspelled words.

That's a quick and dirty solution.
But it's easier to add a "pomppe -> pompe" in a synonym list than tuning
a phonetic filter.
NB: an indexation is required whenever the synonyms file change

Franck Brisbart

Le jeudi 23 mai 2013 à 08:59 +0200, Cristian Cascetta a écrit :
> Hello,
> 
> I think you're confusing three different things:
> 
> 1) schema and fields definition is for precision/recall: treating
> differently a field means different search results and results ranking
> 2) the "pomppe a chaler" problem is more a spellchecking problem
> http://wiki.apache.org/solr/SpellCheckComponent
> 3) "solère" and "solaire" is a phonetic search problem
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory
> 
> Hope this helps a little,
> 
> cristian
> 
> 
> 2013/5/23 It-forum <it...@meseo.fr>
> 
> > Hello again,
> >
> > Is any one could help me, pleeeeeeeeeeeease
> >
> > David
> >
> > Le 22/05/2013 18:09, It-forum a écrit :
> >
> >  Hello to all,
> >>
> >> I'm trying to setup solr 4.2 to index and search into french content.
> >>
> >> I defined a special fieldtype for french content :
> >>
> >>         <fieldType name="text_fr" class="solr.TextField"
> >> positionIncrementGap="100">
> >>                 <analyzer type="index">
> >>                     <charFilter class="solr.**MappingCharFilterFactory"
> >> mapping="mapping-**ISOLatin1Accent.txt"/>
> >>                     <tokenizer class="solr.**
> >> WhitespaceTokenizerFactory"/>
> >>                     <filter class="solr.**WordDelimiterFilterFactory"
> >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >>                     <filter class="solr.**LowerCaseFilterFactory"/>
> >>                     <filter class="solr.**SnowballPorterFilterFactory"
> >> language="French" protected="protwords.txt"/>
> >>                 </analyzer>
> >>
> >>                 <analyzer type="query">
> >>                     <charFilter class="solr.**MappingCharFilterFactory"
> >> mapping="mapping-**ISOLatin1Accent.txt"/>
> >>                     <tokenizer class="solr.**
> >> WhitespaceTokenizerFactory"/>
> >>                     <filter class="solr.**WordDelimiterFilterFactory"
> >> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >>                     <filter class="solr.**LowerCaseFilterFactory"/>
> >>                     <filter class="solr.**SnowballPorterFilterFactory"
> >> language="French" protected="protwords.txt"/>
> >>                 </analyzer>
> >>         </fieldType>
> >>
> >>
> >> unfortunately, this field does not behave as I wish.
> >>
> >> I'd like to be able to get results from unwell spelled word.
> >>
> >> IE : I wish to get the same result typing "Pompe à chaleur" than typing
> >> "pomppe a chaler"  or with "solère" and "solaire"
> >>
> >> I'm do not find the right way to create a fieldtype to reach this aim.
> >>
> >> thanks in advance for your help, do not hesitate for more information if
> >> need.
> >>
> >> Regards
> >>
> >> David
> >>
> >>
> >>
> >



Re: Solr french search optimisation

Posted by Cristian Cascetta <cr...@liquida.it>.
> Could you clarify few more thinks :
>
> - SpellchekComponent and Phonetic, should be use while indexing or only
> while querying ?
>

SpellCheck: you can define a specific field for spellchecking (in this
sense it's a query/schema time) or you can create a specific vocabulary for
spell-checking. I strongly suggest to go through documentation
http://wiki.apache.org/solr/SpellCheckComponent for this component, every
time I used it I've had the need to customize and adapt configuration.


>
> - Does spellcheck component return only the right spelling, or is it used
> to search into result?
>

I'm not sure, please check the documentation, but I remeber that you can
configure it to directly re-execute the spell-corrected query AND show some
alternatives/suggestions to the user (obviously this is a display/frontend
choice)


>
> - If i want to solve Spelling, Phonetic, stemming problem in french
> language. Can I use only one field or should I use several with different
> filters ?
>


I don't think it's possible to use only one field, in my experience I can
suggest you to use multiple fields for multiple scopes, if you're scared by
the index-size remember that fields that are indexed and NOT stored don't
grow your index so much. Set as stored only fields you need to display to
end-user.

Re: Solr french search optimisation

Posted by It-forum <it...@meseo.fr>.
Hello,

Tx Cristian for your details.

I totally agreed with your explanation, this is 2 differents aspect 
which I need to solve.

Could you clarify few more thinks :

- SpellchekComponent and Phonetic, should be use while indexing or only 
while querying ?

- Does spellcheck component return only the right spelling, or is it 
used to search into result?

- If i want to solve Spelling, Phonetic, stemming problem in french 
language. Can I use only one field or should I use several with 
different filters ?

Regards

David


Le 23/05/2013 08:59, Cristian Cascetta a écrit :
> Hello,
>
> I think you're confusing three different things:
>
> 1) schema and fields definition is for precision/recall: treating
> differently a field means different search results and results ranking
> 2) the "pomppe a chaler" problem is more a spellchecking problem
> http://wiki.apache.org/solr/SpellCheckComponent
> 3) "solère" and "solaire" is a phonetic search problem
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory
>
> Hope this helps a little,
>
> cristian
>
>
> 2013/5/23 It-forum <it...@meseo.fr>
>
>> Hello again,
>>
>> Is any one could help me, pleeeeeeeeeeeease
>>
>> David
>>
>> Le 22/05/2013 18:09, It-forum a écrit :
>>
>>   Hello to all,
>>> I'm trying to setup solr 4.2 to index and search into french content.
>>>
>>> I defined a special fieldtype for french content :
>>>
>>>          <fieldType name="text_fr" class="solr.TextField"
>>> positionIncrementGap="100">
>>>                  <analyzer type="index">
>>>                      <charFilter class="solr.**MappingCharFilterFactory"
>>> mapping="mapping-**ISOLatin1Accent.txt"/>
>>>                      <tokenizer class="solr.**
>>> WhitespaceTokenizerFactory"/>
>>>                      <filter class="solr.**WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>>                      <filter class="solr.**LowerCaseFilterFactory"/>
>>>                      <filter class="solr.**SnowballPorterFilterFactory"
>>> language="French" protected="protwords.txt"/>
>>>                  </analyzer>
>>>
>>>                  <analyzer type="query">
>>>                      <charFilter class="solr.**MappingCharFilterFactory"
>>> mapping="mapping-**ISOLatin1Accent.txt"/>
>>>                      <tokenizer class="solr.**
>>> WhitespaceTokenizerFactory"/>
>>>                      <filter class="solr.**WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>>                      <filter class="solr.**LowerCaseFilterFactory"/>
>>>                      <filter class="solr.**SnowballPorterFilterFactory"
>>> language="French" protected="protwords.txt"/>
>>>                  </analyzer>
>>>          </fieldType>
>>>
>>>
>>> unfortunately, this field does not behave as I wish.
>>>
>>> I'd like to be able to get results from unwell spelled word.
>>>
>>> IE : I wish to get the same result typing "Pompe à chaleur" than typing
>>> "pomppe a chaler"  or with "solère" and "solaire"
>>>
>>> I'm do not find the right way to create a fieldtype to reach this aim.
>>>
>>> thanks in advance for your help, do not hesitate for more information if
>>> need.
>>>
>>> Regards
>>>
>>> David
>>>
>>>
>>>


Re: Solr french search optimisation

Posted by Cristian Cascetta <cr...@liquida.it>.
Hello,

I think you're confusing three different things:

1) schema and fields definition is for precision/recall: treating
differently a field means different search results and results ranking
2) the "pomppe a chaler" problem is more a spellchecking problem
http://wiki.apache.org/solr/SpellCheckComponent
3) "solère" and "solaire" is a phonetic search problem
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory

Hope this helps a little,

cristian


2013/5/23 It-forum <it...@meseo.fr>

> Hello again,
>
> Is any one could help me, pleeeeeeeeeeeease
>
> David
>
> Le 22/05/2013 18:09, It-forum a écrit :
>
>  Hello to all,
>>
>> I'm trying to setup solr 4.2 to index and search into french content.
>>
>> I defined a special fieldtype for french content :
>>
>>         <fieldType name="text_fr" class="solr.TextField"
>> positionIncrementGap="100">
>>                 <analyzer type="index">
>>                     <charFilter class="solr.**MappingCharFilterFactory"
>> mapping="mapping-**ISOLatin1Accent.txt"/>
>>                     <tokenizer class="solr.**
>> WhitespaceTokenizerFactory"/>
>>                     <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>                     <filter class="solr.**LowerCaseFilterFactory"/>
>>                     <filter class="solr.**SnowballPorterFilterFactory"
>> language="French" protected="protwords.txt"/>
>>                 </analyzer>
>>
>>                 <analyzer type="query">
>>                     <charFilter class="solr.**MappingCharFilterFactory"
>> mapping="mapping-**ISOLatin1Accent.txt"/>
>>                     <tokenizer class="solr.**
>> WhitespaceTokenizerFactory"/>
>>                     <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>                     <filter class="solr.**LowerCaseFilterFactory"/>
>>                     <filter class="solr.**SnowballPorterFilterFactory"
>> language="French" protected="protwords.txt"/>
>>                 </analyzer>
>>         </fieldType>
>>
>>
>> unfortunately, this field does not behave as I wish.
>>
>> I'd like to be able to get results from unwell spelled word.
>>
>> IE : I wish to get the same result typing "Pompe à chaleur" than typing
>> "pomppe a chaler"  or with "solère" and "solaire"
>>
>> I'm do not find the right way to create a fieldtype to reach this aim.
>>
>> thanks in advance for your help, do not hesitate for more information if
>> need.
>>
>> Regards
>>
>> David
>>
>>
>>
>

Re: Solr french search optimisation

Posted by It-forum <it...@meseo.fr>.
Hello again,

Is any one could help me, pleeeeeeeeeeeease

David

Le 22/05/2013 18:09, It-forum a écrit :
> Hello to all,
>
> I'm trying to setup solr 4.2 to index and search into french content.
>
> I defined a special fieldtype for french content :
>
>         <fieldType name="text_fr" class="solr.TextField" 
> positionIncrementGap="100">
>                 <analyzer type="index">
>                     <charFilter class="solr.MappingCharFilterFactory" 
> mapping="mapping-ISOLatin1Accent.txt"/>
>                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                     <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1" generateNumberParts="1" catenateWords="1" 
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>                     <filter class="solr.LowerCaseFilterFactory"/>
>                     <filter class="solr.SnowballPorterFilterFactory" 
> language="French" protected="protwords.txt"/>
>                 </analyzer>
>
>                 <analyzer type="query">
>                     <charFilter class="solr.MappingCharFilterFactory" 
> mapping="mapping-ISOLatin1Accent.txt"/>
>                     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>                     <filter class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1" generateNumberParts="1" catenateWords="0" 
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>                     <filter class="solr.LowerCaseFilterFactory"/>
>                     <filter class="solr.SnowballPorterFilterFactory" 
> language="French" protected="protwords.txt"/>
>                 </analyzer>
>         </fieldType>
>
>
> unfortunately, this field does not behave as I wish.
>
> I'd like to be able to get results from unwell spelled word.
>
> IE : I wish to get the same result typing "Pompe à chaleur" than 
> typing "pomppe a chaler"  or with "solère" and "solaire"
>
> I'm do not find the right way to create a fieldtype to reach this aim.
>
> thanks in advance for your help, do not hesitate for more information 
> if need.
>
> Regards
>
> David
>
>