You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by elisabeth benoit <el...@gmail.com> on 2011/12/08 10:26:03 UTC

Solr 3.4 problem with words separated by coma without space

Hello,

I'm using Solr 3.4, and I'm having a problem with a request returning
different results if I have or not a space after a coma.

The request "name, number rue taine paris" returns results with 4 words out
of 5 matching ("name", "number", "rue", "paris")

The request "name,number rue taine paris" (no space between coma and
"number") returns no results, unless I set mm=3, and then matching words
are "rue", "taine", "paris".

If I check in the solr.admin.analyzer, I get the same analysis for the two
different requests. But it seems, if fact, that the lacking space after
coma prevents name and number from matching.


My field type is


      <analyzer type="query">
        <!-- découpage standard -->
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <!-- normalisation des accents, cédilles, e dans l'o,... -->
        <charFilter class="solr.MappingCharFilterFactory"
mapping="mapping-ISOLatin1Accent.txt"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <!-- suppression des . (I.B.M. => IBM) -->
        <filter class="solr.StandardFilterFactory"/>
        <!-- passage en minuscules -->
        <filter class="solr.LowerCaseFilterFactory"/>
        <!-- suppression de la ponctuation -->
        <filter class="solr.PatternReplaceFilterFactory"
pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
        <!-- suppression des tokens vides et des mots démesurés -->
        <filter class="solr.LengthFilterFactory" min="1" max="100" />
        <!-- découpage des mots composés -->
        <filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1"
generateWordParts="1"

generateNumberParts="1" catenateWords="0" catenateNumbers="1"
catenateAll="0" preserveOriginal="1"/>
        <!-- suppression des élisions (l', qu',...) -->
        <filter class="solr.ElisionFilterFactory"
articles="elisionwords.txt"/>
        <!-- suppression des mots insignifiants -->
        <filter class="solr.StopFilterFactory" ignoreCase="1"
words="stopwords.txt" enablePositionIncrements="true"/>
        <!-- lemmatisation (pluriels,...) -->
        <filter class="solr.SnowballPorterFilterFactory" language="French"
protected="protwords.txt"/>
        <!-- suppression des doublons éventuels -->
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>

Anyone has a clue?

Thanks,
Elisabeth

Re: Solr 3.4 problem with words separated by coma without space

Posted by elisabeth benoit <el...@gmail.com>.

Thanks for the answer.

yes in fact when I look at debugQuery output, I notice that name and number
are never treated as single entries.

I have

(((text:name text:number)) (text:ru) (text:tain) (text:paris)))

so name and number are in same parenthesis, but not exactlly treated as a
phrase, as far as I know, since a phrase would be more like text:"name
number".

could you tell me what is the difference between (text:name text:number)
and (text:"name number")?

I'll check autoGeneratePhraseQueries.

Best regards,
Elisabeth




2011/12/8 Chris Hostetter <ho...@fucit.org>

>
> : If I check in the solr.admin.analyzer, I get the same analysis for the
> two
> : different requests. But it seems, if fact, that the lacking space after
> : coma prevents name and number from matching.
>
> query analysis is only part of hte picture ... Did you look at the
> debuqQuery output? ...  i believe you are seeing the effects of the
> QueryParser analyzing "name," distinctly from "number" in one case, vs
> analyzing the entire string "name,number" in the second case, an treating
> the later as a phrase query (because one input clause produces multiple
> tokens)
>
> there is a recently added autoGeneratePhraseQueries option that affects
> this.
>
>
> -Hoss
>

Re: Solr 3.4 problem with words separated by coma without space

Posted by Chris Hostetter <ho...@fucit.org>.

: If I check in the solr.admin.analyzer, I get the same analysis for the two
: different requests. But it seems, if fact, that the lacking space after
: coma prevents name and number from matching.

query analysis is only part of hte picture ... Did you look at the 
debuqQuery output? ...  i believe you are seeing the effects of the 
QueryParser analyzing "name," distinctly from "number" in one case, vs 
analyzing the entire string "name,number" in the second case, an treating 
the later as a phrase query (because one input clause produces multiple 
tokens)

there is a recently added autoGeneratePhraseQueries option that affects 
this.


-Hoss

Re: Solr 3.4 problem with words separated by coma without space

Posted by da...@ontrenet.com.

This would seem to indicate that you are using a whitespace analyzer on
the default search field. I believe other analyzers will properly tokenize
around the comma.

> same problem with Solr 4.0
>
> 2011/12/8 elisabeth benoit <el...@gmail.com>
>
>>
>>
>> Hello,
>>
>> I'm using Solr 3.4, and I'm having a problem with a request returning
>> different results if I have or not a space after a coma.
>>
>> The request "name, number rue taine paris" returns results with 4 words
>> out of 5 matching ("name", "number", "rue", "paris")
>>
>> The request "name,number rue taine paris" (no space between coma and
>> "number") returns no results, unless I set mm=3, and then matching words
>> are "rue", "taine", "paris".
>>
>> If I check in the solr.admin.analyzer, I get the same analysis for the
>> two
>> different requests. But it seems, if fact, that the lacking space after
>> coma prevents name and number from matching.
>>
>>
>> My field type is
>>
>>
>>       <analyzer type="query">
>>         <!-- découpage standard -->
>>         <tokenizer class="solr.StandardTokenizerFactory"/>
>>         <!-- normalisation des accents, cédilles, e dans l'o,... -->
>>         <charFilter class="solr.MappingCharFilterFactory"
>> mapping="mapping-ISOLatin1Accent.txt"/>
>>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>>         <!-- suppression des . (I.B.M. => IBM) -->
>>         <filter class="solr.StandardFilterFactory"/>
>>         <!-- passage en minuscules -->
>>         <filter class="solr.LowerCaseFilterFactory"/>
>>         <!-- suppression de la ponctuation -->
>>         <filter class="solr.PatternReplaceFilterFactory"
>> pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
>>         <!-- suppression des tokens vides et des mots démesurés -->
>>         <filter class="solr.LengthFilterFactory" min="1" max="100" />
>>         <!-- découpage des mots composés -->
>>         <filter class="solr.WordDelimiterFilterFactory"
>> splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1"
>> generateWordParts="1"
>>
>> generateNumberParts="1" catenateWords="0" catenateNumbers="1"
>> catenateAll="0" preserveOriginal="1"/>
>>         <!-- suppression des élisions (l', qu',...) -->
>>         <filter class="solr.ElisionFilterFactory"
>> articles="elisionwords.txt"/>
>>         <!-- suppression des mots insignifiants -->
>>         <filter class="solr.StopFilterFactory" ignoreCase="1"
>> words="stopwords.txt" enablePositionIncrements="true"/>
>>         <!-- lemmatisation (pluriels,...) -->
>>         <filter class="solr.SnowballPorterFilterFactory"
>> language="French"
>> protected="protwords.txt"/>
>>         <!-- suppression des doublons éventuels -->
>>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>>       </analyzer>
>>
>> Anyone has a clue?
>>
>> Thanks,
>> Elisabeth
>>
>

Re: Solr 3.4 problem with words separated by coma without space

Posted by elisabeth benoit <el...@gmail.com>.

same problem with Solr 4.0

2011/12/8 elisabeth benoit <el...@gmail.com>

>
>
> Hello,
>
> I'm using Solr 3.4, and I'm having a problem with a request returning
> different results if I have or not a space after a coma.
>
> The request "name, number rue taine paris" returns results with 4 words
> out of 5 matching ("name", "number", "rue", "paris")
>
> The request "name,number rue taine paris" (no space between coma and
> "number") returns no results, unless I set mm=3, and then matching words
> are "rue", "taine", "paris".
>
> If I check in the solr.admin.analyzer, I get the same analysis for the two
> different requests. But it seems, if fact, that the lacking space after
> coma prevents name and number from matching.
>
>
> My field type is
>
>
>       <analyzer type="query">
>         <!-- découpage standard -->
>         <tokenizer class="solr.StandardTokenizerFactory"/>
>         <!-- normalisation des accents, cédilles, e dans l'o,... -->
>         <charFilter class="solr.MappingCharFilterFactory"
> mapping="mapping-ISOLatin1Accent.txt"/>
>         <filter class="solr.ASCIIFoldingFilterFactory"/>
>         <!-- suppression des . (I.B.M. => IBM) -->
>         <filter class="solr.StandardFilterFactory"/>
>         <!-- passage en minuscules -->
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <!-- suppression de la ponctuation -->
>         <filter class="solr.PatternReplaceFilterFactory"
> pattern="^(\p{Punct}*)(.*?)(\p{Punct}*)$" replacement="$2"/>
>         <!-- suppression des tokens vides et des mots démesurés -->
>         <filter class="solr.LengthFilterFactory" min="1" max="100" />
>         <!-- découpage des mots composés -->
>         <filter class="solr.WordDelimiterFilterFactory"
> splitOnCaseChange="1" splitOnNumerics="1" stemEnglishPossessive="1"
> generateWordParts="1"
>
> generateNumberParts="1" catenateWords="0" catenateNumbers="1"
> catenateAll="0" preserveOriginal="1"/>
>         <!-- suppression des élisions (l', qu',...) -->
>         <filter class="solr.ElisionFilterFactory"
> articles="elisionwords.txt"/>
>         <!-- suppression des mots insignifiants -->
>         <filter class="solr.StopFilterFactory" ignoreCase="1"
> words="stopwords.txt" enablePositionIncrements="true"/>
>         <!-- lemmatisation (pluriels,...) -->
>         <filter class="solr.SnowballPorterFilterFactory" language="French"
> protected="protwords.txt"/>
>         <!-- suppression des doublons éventuels -->
>         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
>       </analyzer>
>
> Anyone has a clue?
>
> Thanks,
> Elisabeth
>