You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by It-forum <it...@meseo.fr> on 2013/05/22 18:09:10 UTC
Solr french search optimisation
Hello to all,
I'm trying to setup solr 4.2 to index and search into french content.
I defined a special fieldtype for french content :
<fieldType name="text_fr" class="solr.TextField"
positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="French" protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1" generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="French" protected="protwords.txt"/>
</analyzer>
</fieldType>
unfortunately, this field does not behave as I wish.
I'd like to be able to get results from unwell spelled word.
IE : I wish to get the same result typing "Pompe à chaleur" than typing
"pomppe a chaler" or with "solère" and "solaire"
I'm do not find the right way to create a fieldtype to reach this aim.
thanks in advance for your help, do not hesitate for more information if
need.
Regards
David
Re: Solr french search optimisation
Posted by fbrisbart <fb...@bestofmedia.com>.
You can also think about using a SynonymFilter if you can list the
misspelled words.
That's a quick and dirty solution.
But it's easier to add a "pomppe -> pompe" in a synonym list than tuning
a phonetic filter.
NB: an indexation is required whenever the synonyms file change
Franck Brisbart
Le jeudi 23 mai 2013 à 08:59 +0200, Cristian Cascetta a écrit :
> Hello,
>
> I think you're confusing three different things:
>
> 1) schema and fields definition is for precision/recall: treating
> differently a field means different search results and results ranking
> 2) the "pomppe a chaler" problem is more a spellchecking problem
> http://wiki.apache.org/solr/SpellCheckComponent
> 3) "solère" and "solaire" is a phonetic search problem
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory
>
> Hope this helps a little,
>
> cristian
>
>
> 2013/5/23 It-forum <it...@meseo.fr>
>
> > Hello again,
> >
> > Is any one could help me, pleeeeeeeeeeeease
> >
> > David
> >
> > Le 22/05/2013 18:09, It-forum a écrit :
> >
> > Hello to all,
> >>
> >> I'm trying to setup solr 4.2 to index and search into french content.
> >>
> >> I defined a special fieldtype for french content :
> >>
> >> <fieldType name="text_fr" class="solr.TextField"
> >> positionIncrementGap="100">
> >> <analyzer type="index">
> >> <charFilter class="solr.**MappingCharFilterFactory"
> >> mapping="mapping-**ISOLatin1Accent.txt"/>
> >> <tokenizer class="solr.**
> >> WhitespaceTokenizerFactory"/>
> >> <filter class="solr.**WordDelimiterFilterFactory"
> >> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> >> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> >> <filter class="solr.**LowerCaseFilterFactory"/>
> >> <filter class="solr.**SnowballPorterFilterFactory"
> >> language="French" protected="protwords.txt"/>
> >> </analyzer>
> >>
> >> <analyzer type="query">
> >> <charFilter class="solr.**MappingCharFilterFactory"
> >> mapping="mapping-**ISOLatin1Accent.txt"/>
> >> <tokenizer class="solr.**
> >> WhitespaceTokenizerFactory"/>
> >> <filter class="solr.**WordDelimiterFilterFactory"
> >> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> >> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> >> <filter class="solr.**LowerCaseFilterFactory"/>
> >> <filter class="solr.**SnowballPorterFilterFactory"
> >> language="French" protected="protwords.txt"/>
> >> </analyzer>
> >> </fieldType>
> >>
> >>
> >> unfortunately, this field does not behave as I wish.
> >>
> >> I'd like to be able to get results from unwell spelled word.
> >>
> >> IE : I wish to get the same result typing "Pompe à chaleur" than typing
> >> "pomppe a chaler" or with "solère" and "solaire"
> >>
> >> I'm do not find the right way to create a fieldtype to reach this aim.
> >>
> >> thanks in advance for your help, do not hesitate for more information if
> >> need.
> >>
> >> Regards
> >>
> >> David
> >>
> >>
> >>
> >
Re: Solr french search optimisation
Posted by Cristian Cascetta <cr...@liquida.it>.
> Could you clarify few more thinks :
>
> - SpellchekComponent and Phonetic, should be use while indexing or only
> while querying ?
>
SpellCheck: you can define a specific field for spellchecking (in this
sense it's a query/schema time) or you can create a specific vocabulary for
spell-checking. I strongly suggest to go through documentation
http://wiki.apache.org/solr/SpellCheckComponent for this component, every
time I used it I've had the need to customize and adapt configuration.
>
> - Does spellcheck component return only the right spelling, or is it used
> to search into result?
>
I'm not sure, please check the documentation, but I remeber that you can
configure it to directly re-execute the spell-corrected query AND show some
alternatives/suggestions to the user (obviously this is a display/frontend
choice)
>
> - If i want to solve Spelling, Phonetic, stemming problem in french
> language. Can I use only one field or should I use several with different
> filters ?
>
I don't think it's possible to use only one field, in my experience I can
suggest you to use multiple fields for multiple scopes, if you're scared by
the index-size remember that fields that are indexed and NOT stored don't
grow your index so much. Set as stored only fields you need to display to
end-user.
Re: Solr french search optimisation
Posted by It-forum <it...@meseo.fr>.
Hello,
Tx Cristian for your details.
I totally agreed with your explanation, this is 2 differents aspect
which I need to solve.
Could you clarify few more thinks :
- SpellchekComponent and Phonetic, should be use while indexing or only
while querying ?
- Does spellcheck component return only the right spelling, or is it
used to search into result?
- If i want to solve Spelling, Phonetic, stemming problem in french
language. Can I use only one field or should I use several with
different filters ?
Regards
David
Le 23/05/2013 08:59, Cristian Cascetta a écrit :
> Hello,
>
> I think you're confusing three different things:
>
> 1) schema and fields definition is for precision/recall: treating
> differently a field means different search results and results ranking
> 2) the "pomppe a chaler" problem is more a spellchecking problem
> http://wiki.apache.org/solr/SpellCheckComponent
> 3) "solère" and "solaire" is a phonetic search problem
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory
>
> Hope this helps a little,
>
> cristian
>
>
> 2013/5/23 It-forum <it...@meseo.fr>
>
>> Hello again,
>>
>> Is any one could help me, pleeeeeeeeeeeease
>>
>> David
>>
>> Le 22/05/2013 18:09, It-forum a écrit :
>>
>> Hello to all,
>>> I'm trying to setup solr 4.2 to index and search into french content.
>>>
>>> I defined a special fieldtype for french content :
>>>
>>> <fieldType name="text_fr" class="solr.TextField"
>>> positionIncrementGap="100">
>>> <analyzer type="index">
>>> <charFilter class="solr.**MappingCharFilterFactory"
>>> mapping="mapping-**ISOLatin1Accent.txt"/>
>>> <tokenizer class="solr.**
>>> WhitespaceTokenizerFactory"/>
>>> <filter class="solr.**WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>>> <filter class="solr.**LowerCaseFilterFactory"/>
>>> <filter class="solr.**SnowballPorterFilterFactory"
>>> language="French" protected="protwords.txt"/>
>>> </analyzer>
>>>
>>> <analyzer type="query">
>>> <charFilter class="solr.**MappingCharFilterFactory"
>>> mapping="mapping-**ISOLatin1Accent.txt"/>
>>> <tokenizer class="solr.**
>>> WhitespaceTokenizerFactory"/>
>>> <filter class="solr.**WordDelimiterFilterFactory"
>>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>>> <filter class="solr.**LowerCaseFilterFactory"/>
>>> <filter class="solr.**SnowballPorterFilterFactory"
>>> language="French" protected="protwords.txt"/>
>>> </analyzer>
>>> </fieldType>
>>>
>>>
>>> unfortunately, this field does not behave as I wish.
>>>
>>> I'd like to be able to get results from unwell spelled word.
>>>
>>> IE : I wish to get the same result typing "Pompe à chaleur" than typing
>>> "pomppe a chaler" or with "solère" and "solaire"
>>>
>>> I'm do not find the right way to create a fieldtype to reach this aim.
>>>
>>> thanks in advance for your help, do not hesitate for more information if
>>> need.
>>>
>>> Regards
>>>
>>> David
>>>
>>>
>>>
Re: Solr french search optimisation
Posted by Cristian Cascetta <cr...@liquida.it>.
Hello,
I think you're confusing three different things:
1) schema and fields definition is for precision/recall: treating
differently a field means different search results and results ranking
2) the "pomppe a chaler" problem is more a spellchecking problem
http://wiki.apache.org/solr/SpellCheckComponent
3) "solère" and "solaire" is a phonetic search problem
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PhoneticFilterFactory
Hope this helps a little,
cristian
2013/5/23 It-forum <it...@meseo.fr>
> Hello again,
>
> Is any one could help me, pleeeeeeeeeeeease
>
> David
>
> Le 22/05/2013 18:09, It-forum a écrit :
>
> Hello to all,
>>
>> I'm trying to setup solr 4.2 to index and search into french content.
>>
>> I defined a special fieldtype for french content :
>>
>> <fieldType name="text_fr" class="solr.TextField"
>> positionIncrementGap="100">
>> <analyzer type="index">
>> <charFilter class="solr.**MappingCharFilterFactory"
>> mapping="mapping-**ISOLatin1Accent.txt"/>
>> <tokenizer class="solr.**
>> WhitespaceTokenizerFactory"/>
>> <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="1"
>> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.**LowerCaseFilterFactory"/>
>> <filter class="solr.**SnowballPorterFilterFactory"
>> language="French" protected="protwords.txt"/>
>> </analyzer>
>>
>> <analyzer type="query">
>> <charFilter class="solr.**MappingCharFilterFactory"
>> mapping="mapping-**ISOLatin1Accent.txt"/>
>> <tokenizer class="solr.**
>> WhitespaceTokenizerFactory"/>
>> <filter class="solr.**WordDelimiterFilterFactory"
>> generateWordParts="1" generateNumberParts="1" catenateWords="0"
>> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
>> <filter class="solr.**LowerCaseFilterFactory"/>
>> <filter class="solr.**SnowballPorterFilterFactory"
>> language="French" protected="protwords.txt"/>
>> </analyzer>
>> </fieldType>
>>
>>
>> unfortunately, this field does not behave as I wish.
>>
>> I'd like to be able to get results from unwell spelled word.
>>
>> IE : I wish to get the same result typing "Pompe à chaleur" than typing
>> "pomppe a chaler" or with "solère" and "solaire"
>>
>> I'm do not find the right way to create a fieldtype to reach this aim.
>>
>> thanks in advance for your help, do not hesitate for more information if
>> need.
>>
>> Regards
>>
>> David
>>
>>
>>
>
Re: Solr french search optimisation
Posted by It-forum <it...@meseo.fr>.
Hello again,
Is any one could help me, pleeeeeeeeeeeease
David
Le 22/05/2013 18:09, It-forum a écrit :
> Hello to all,
>
> I'm trying to setup solr 4.2 to index and search into french content.
>
> I defined a special fieldtype for french content :
>
> <fieldType name="text_fr" class="solr.TextField"
> positionIncrementGap="100">
> <analyzer type="index">
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory"
> language="French" protected="protwords.txt"/>
> </analyzer>
>
> <analyzer type="query">
> <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> <filter class="solr.WordDelimiterFilterFactory"
> generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.SnowballPorterFilterFactory"
> language="French" protected="protwords.txt"/>
> </analyzer>
> </fieldType>
>
>
> unfortunately, this field does not behave as I wish.
>
> I'd like to be able to get results from unwell spelled word.
>
> IE : I wish to get the same result typing "Pompe à chaleur" than
> typing "pomppe a chaler" or with "solère" and "solaire"
>
> I'm do not find the right way to create a fieldtype to reach this aim.
>
> thanks in advance for your help, do not hesitate for more information
> if need.
>
> Regards
>
> David
>
>