You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Robert Gründler <ro...@dubture.com> on 2010/11/11 16:51:18 UTC

EdgeNGram relevancy

Hi,

consider the following fieldtype (used for autocompletion):

  <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />     
		 <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
   </analyzer>
   <analyzer type="query">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
		 <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
   </analyzer>
  </fieldType>


This works fine as long as the query string is a single word. For multiple words, the ranking is weird though.

Example:

Query String: "Bill Cl"

Result (in that order):

- Clyde Phillips
- Clay Rogers
- Roger Cloud
- Bill Clinton

"Bill Clinton" should have the highest rank in that case.  

Has anyone an idea how to to configure this fieldtype to make matches in both tokens rank higher than those who match in either token?


thanks!


-robert

Re: EdgeNGram relevancy

Posted by Nick Martin <ia...@googlemail.com>.

On 12 Nov 2010, at 01:46, Ahmet Arslan <io...@yahoo.com> wrote:

>> This setup now makes troubles regarding StopWords, here's
>> an example:
>> 
>> Let's say the index contains 2 Strings: "Mr Martin
>> Scorsese" and "Martin Scorsese". "Mr" is in the stopword
>> list.
>> 
>> Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
>> 
>> This way, the only result i get is "Mr Martin Scorsese",
>> because the strict field edgytext2 is boosted by 2.0. 
>> 
>> Any idea why in this case "Martin Scorsese" is not in the
>> result at all?
> 
> Did you run your query without using () and "" operators? If yes can you try this?
> &q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0
> 
> If no can you paste output of &debugQuery=on
> 
> 
> 

This would still not deal with the problem of removing stop words from the indexing and query analysis stages.

I really need something that will allow that and give a single token as in the example below.

Best

Nick

Re: EdgeNGram relevancy

Posted by Robert Gründler <ro...@dubture.com>.

it seems adding the '+' (required) operator to each term in a multi-term query does the trick:

http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#+

ie: edgytext2:(+Martin +Sco)


-robert



On Nov 16, 2010, at 8:52 PM, Robert Gründler wrote:

> thanks for the explanation.
> 
> the results for the autocompletion are pretty good now, but we still have a small problem. 
> 
> When there are hits in the "edgytext2" fields, results which only have hits in the "edgytext" field
> should not be returned at all.
> 
> Example:
> 
> Query: "Martin Sco"
> 
> Current Results (in that order):
> 
> - "Martin Scorsese"
> - "Martin Lawrence"
> - "Joseph Martin"
> 
> However, in an autocompletion context, only "Martin Scorsese" makes sense, the 2 others are logically
> not correct.
> 
> I'm not sure if this can be solved on the solr side, or if we should implement the logic in the
> application.
> 
> 
> thanks!
> 
> -robert
> 
> 
> 
> 
> 
> 
> 
> On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote:
> 
>> Without the parens, the "edgytext:" only applied to "Mr", the default field still applied to "Scorcese".
>> 
>> The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will "pre-tokenize" on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't "pre-tokenize" like this, it passes the whole phrase to the index intact.
>> 
>> Robert Gründler wrote:
>>>> Did you run your query without using () and "" operators? If yes can you try this?
>>>> &q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0
>>>> 
>>> 
>>> I didn't use () and "" in my query before. Using the query with those operators
>>> works now, stopwords are thrown out as the should, thanks.
>>> 
>>> However, i don't understand how the () and "" operators affect the StopWordFilter.
>>> 
>>> Could you give a brief explanation for the above example?
>>> 
>>> thanks!
>>> 
>>> 
>>> -robert
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>

Re: EdgeNGram relevancy

Posted by Robert Gründler <ro...@dubture.com>.

thanks for the explanation.

the results for the autocompletion are pretty good now, but we still have a small problem. 

When there are hits in the "edgytext2" fields, results which only have hits in the "edgytext" field
should not be returned at all.

Example:

Query: "Martin Sco"

Current Results (in that order):

- "Martin Scorsese"
- "Martin Lawrence"
- "Joseph Martin"

However, in an autocompletion context, only "Martin Scorsese" makes sense, the 2 others are logically
not correct.

I'm not sure if this can be solved on the solr side, or if we should implement the logic in the
application.

thanks!

-robert

On Nov 12, 2010, at 12:13 AM, Jonathan Rochkind wrote:

> Without the parens, the "edgytext:" only applied to "Mr", the default field still applied to "Scorcese".
> 
> The double quotes are neccesary in the second case (rather than parens), because on a non-tokenized field because the standard query parser will "pre-tokenize" on whitespace before sending individual white-space seperated words to match the index. If the index includes multi-word tokens with internal whitespace, they will never match. But the standard query parser doesn't "pre-tokenize" like this, it passes the whole phrase to the index intact.
> 
> Robert Gründler wrote:
>>> Did you run your query without using () and "" operators? If yes can you try this?
>>> &q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0
>>>    
>> 
>> I didn't use () and "" in my query before. Using the query with those operators
>> works now, stopwords are thrown out as the should, thanks.
>> 
>> However, i don't understand how the () and "" operators affect the StopWordFilter.
>> 
>> Could you give a brief explanation for the above example?
>> 
>> thanks!
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
>> 
>>

Re: EdgeNGram relevancy

Posted by Jonathan Rochkind <ro...@jhu.edu>.

Without the parens, the "edgytext:" only applied to "Mr", the default 
field still applied to "Scorcese".

The double quotes are neccesary in the second case (rather than parens), 
because on a non-tokenized field because the standard query parser will 
"pre-tokenize" on whitespace before sending individual white-space 
seperated words to match the index. If the index includes multi-word 
tokens with internal whitespace, they will never match. But the standard 
query parser doesn't "pre-tokenize" like this, it passes the whole 
phrase to the index intact.

Robert Gründler wrote:
>> Did you run your query without using () and "" operators? If yes can you try this?
>> &q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0
>>     
>
> I didn't use () and "" in my query before. Using the query with those operators
> works now, stopwords are thrown out as the should, thanks.
>
> However, i don't understand how the () and "" operators affect the StopWordFilter.
>
> Could you give a brief explanation for the above example?
>
> thanks!
>
>
> -robert
>
>
>
>
>
>

Re: EdgeNGram relevancy

Posted by Robert Gründler <ro...@dubture.com>.

> 
> Did you run your query without using () and "" operators? If yes can you try this?
> &q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0

I didn't use () and "" in my query before. Using the query with those operators
works now, stopwords are thrown out as the should, thanks.

However, i don't understand how the () and "" operators affect the StopWordFilter.

Could you give a brief explanation for the above example?

thanks!


-robert

Re: EdgeNGram relevancy

Posted by Ahmet Arslan <io...@yahoo.com>.

> This setup now makes troubles regarding StopWords, here's
> an example:
> 
> Let's say the index contains 2 Strings: "Mr Martin
> Scorsese" and "Martin Scorsese". "Mr" is in the stopword
> list.
> 
> Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0
> 
> This way, the only result i get is "Mr Martin Scorsese",
> because the strict field edgytext2 is boosted by 2.0. 
> 
> Any idea why in this case "Martin Scorsese" is not in the
> result at all?

Did you run your query without using () and "" operators? If yes can you try this?
&q=edgytext:(Mr Scorsese) OR edgytext2:"Mr Scorsese"^2.0

If no can you paste output of &debugQuery=on

Re: EdgeNGram relevancy

Posted by Robert Gründler <ro...@dubture.com>.

thanks a lot, that setup works pretty well now.

the only problem now is that the StopWords do not work that good anymore. I'll provide an example, but first the 2 fieldtypes:

  <!-- autocomplete field which finds matches inside strings ("scor" matches "Martin Scorsese") -->
  
  <fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />     
		 <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
   </analyzer>
   <analyzer type="query">
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
		 <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
   </analyzer>
  </fieldType>
  
  <!-- autocomplete field which finds "startsWith" matches only ("scor" matches only "Scorpio", but not "Martin Scorsese") -->  

  <fieldType name="edgytext2" class="solr.TextField" positionIncrementGap="100">
   <analyzer type="index">
     <tokenizer class="solr.KeywordTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
		 <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
     <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
   </analyzer>
   <analyzer type="query">
     <tokenizer class="solr.KeywordTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
		 <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
   </analyzer>
  </fieldType>


This setup now makes troubles regarding StopWords, here's an example:

Let's say the index contains 2 Strings: "Mr Martin Scorsese" and "Martin Scorsese". "Mr" is in the stopword list.

Query: edgytext:Mr Scorsese OR edgytext2:Mr Scorsese^2.0

This way, the only result i get is "Mr Martin Scorsese", because the strict field edgytext2 is boosted by 2.0. 

Any idea why in this case "Martin Scorsese" is not in the result at all?


thanks again!


-robert






On Nov 11, 2010, at 5:57 PM, Ahmet Arslan wrote:

> You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. 
> 
> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> 
> You can even apply boost so that begins with matches comes first.
> 
> --- On Thu, 11/11/10, Robert Gründler <ro...@dubture.com> wrote:
> 
>> From: Robert Gründler <ro...@dubture.com>
>> Subject: EdgeNGram relevancy
>> To: solr-user@lucene.apache.org
>> Date: Thursday, November 11, 2010, 5:51 PM
>> Hi,
>> 
>> consider the following fieldtype (used for
>> autocompletion):
>> 
>>   <fieldType name="edgytext" class="solr.TextField"
>> positionIncrementGap="100">
>>    <analyzer type="index">
>>      <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/>
>>      <filter
>> class="solr.LowerCaseFilterFactory"/>
>>      <filter
>> class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true"
>> />     
>>          <filter
>> class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
>> replacement="" replace="all" />
>>      <filter
>> class="solr.EdgeNGramFilterFactory" minGramSize="1"
>> maxGramSize="25" />
>>    </analyzer>
>>    <analyzer type="query">
>>      <tokenizer
>> class="solr.WhitespaceTokenizerFactory"/>
>>      <filter
>> class="solr.LowerCaseFilterFactory"/>
>>      <filter
>> class="solr.StopFilterFactory" ignoreCase="true"
>> words="stopwords.txt" enablePositionIncrements="true" />
>>          <filter
>> class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
>> replacement="" replace="all" />
>>    </analyzer>
>>   </fieldType>
>> 
>> 
>> This works fine as long as the query string is a single
>> word. For multiple words, the ranking is weird though.
>> 
>> Example:
>> 
>> Query String: "Bill Cl"
>> 
>> Result (in that order):
>> 
>> - Clyde Phillips
>> - Clay Rogers
>> - Roger Cloud
>> - Bill Clinton
>> 
>> "Bill Clinton" should have the highest rank in that
>> case.  
>> 
>> Has anyone an idea how to to configure this fieldtype to
>> make matches in both tokens rank higher than those who match
>> in either token?
>> 
>> 
>> thanks!
>> 
>> 
>> -robert
>> 
>> 
>> 
>> 
> 
> 
>

Re: EdgeNGram relevancy

Posted by Andy <an...@yahoo.com>.

Ah I see. Thanks for the explanation.

Could you set the defaultOperator to "AND"? That way both "Bill" and "Cl" must be a match and that would exclude "Clyde Phillips".


--- On Thu, 11/11/10, Robert Gründler <ro...@dubture.com> wrote:

> From: Robert Gründler <ro...@dubture.com>
> Subject: Re: EdgeNGram relevancy
> To: solr-user@lucene.apache.org
> Date: Thursday, November 11, 2010, 3:51 PM
> according to the fieldtype i posted
> previously, i think it's because of:
> 
> 1. WhiteSpaceTokenizer splits the String "Clyde Phillips"
> into 2 tokens: "Clyde" and "Phillips"
> 2. EdgeNGramFilter gets the 2 tokens, and creates an
> EdgeNGram for each token: "C" "Cl" "Cly"
> ...   AND  "P" "Ph" "Phi" ...
> 
> The Query String "Bill Cl" gets split up in 2 Tokens "Bill"
> and "Cl" by the WhitespaceTokenizer.
> 
> This creates a match for the 2nd token "Ci" of the query,
> and one of the "sub"tokens the EdgeNGramFilter created:
> "Cl".
> 
> 
> -robert
> 
> 
> 
> 
> On Nov 11, 2010, at 21:34 , Andy wrote:
> 
> > Could anyone help me understand what does "Clyde
> Phillips" appear in the results for "Bill Cl"??
> > 
> > "Clyde Phillips" doesn't produce any EdgeNGram that
> would match "Bill Cl", so why is it even in the results?
> > 
> > Thanks.
> > 
> > --- On Thu, 11/11/10, Ahmet Arslan <io...@yahoo.com>
> wrote:
> > 
> >> You can add an additional field, with
> >> using KeywordTokenizerFactory instead of
> >> WhitespaceTokenizerFactory. And query both these
> fields with
> >> an OR operator. 
> >> 
> >> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> >> 
> >> You can even apply boost so that begins with
> matches comes
> >> first.
> >> 
> >> --- On Thu, 11/11/10, Robert Gründler <ro...@dubture.com>
> >> wrote:
> >> 
> >>> From: Robert Gründler <ro...@dubture.com>
> >>> Subject: EdgeNGram relevancy
> >>> To: solr-user@lucene.apache.org
> >>> Date: Thursday, November 11, 2010, 5:51 PM
> >>> Hi,
> >>> 
> >>> consider the following fieldtype (used for
> >>> autocompletion):
> >>> 
> >>>   <fieldType
> name="edgytext"
> >> class="solr.TextField"
> >>> positionIncrementGap="100">
> >>>    <analyzer type="index">
> >>>      <tokenizer
> >>> class="solr.WhitespaceTokenizerFactory"/>
> >>>      <filter
> >>> class="solr.LowerCaseFilterFactory"/>
> >>>      <filter
> >>> class="solr.StopFilterFactory"
> ignoreCase="true"
> >>> words="stopwords.txt"
> enablePositionIncrements="true"
> >>> />     
> >>>          <filter
> >>> class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >>> replacement="" replace="all" />
> >>>      <filter
> >>> class="solr.EdgeNGramFilterFactory"
> minGramSize="1"
> >>> maxGramSize="25" />
> >>>    </analyzer>
> >>>    <analyzer type="query">
> >>>      <tokenizer
> >>> class="solr.WhitespaceTokenizerFactory"/>
> >>>      <filter
> >>> class="solr.LowerCaseFilterFactory"/>
> >>>      <filter
> >>> class="solr.StopFilterFactory"
> ignoreCase="true"
> >>> words="stopwords.txt"
> enablePositionIncrements="true"
> >> />
> >>>          <filter
> >>> class="solr.PatternReplaceFilterFactory"
> >> pattern="([^a-z])"
> >>> replacement="" replace="all" />
> >>>    </analyzer>
> >>>   </fieldType>
> >>> 
> >>> 
> >>> This works fine as long as the query string is
> a
> >> single
> >>> word. For multiple words, the ranking is
> weird
> >> though.
> >>> 
> >>> Example:
> >>> 
> >>> Query String: "Bill Cl"
> >>> 
> >>> Result (in that order):
> >>> 
> >>> - Clyde Phillips
> >>> - Clay Rogers
> >>> - Roger Cloud
> >>> - Bill Clinton
> >>> 
> >>> "Bill Clinton" should have the highest rank in
> that
> >>> case.  
> >>> 
> >>> Has anyone an idea how to to configure this
> fieldtype
> >> to
> >>> make matches in both tokens rank higher than
> those who
> >> match
> >>> in either token?
> >>> 
> >>> 
> >>> thanks!
> >>> 
> >>> 
> >>> -robert
> >>> 
> >>> 
> >>> 
> >>> 
> >> 
> >> 
> >> 
> >> 
> > 
> > 
> > 
> 
>

Re: EdgeNGram relevancy

Posted by Robert Gründler <ro...@dubture.com>.

according to the fieldtype i posted previously, i think it's because of:

1. WhiteSpaceTokenizer splits the String "Clyde Phillips" into 2 tokens: "Clyde" and "Phillips"
2. EdgeNGramFilter gets the 2 tokens, and creates an EdgeNGram for each token: "C" "Cl" "Cly" ...   AND  "P" "Ph" "Phi" ...

The Query String "Bill Cl" gets split up in 2 Tokens "Bill" and "Cl" by the WhitespaceTokenizer.

This creates a match for the 2nd token "Ci" of the query, and one of the "sub"tokens the EdgeNGramFilter created: "Cl".


-robert




On Nov 11, 2010, at 21:34 , Andy wrote:

> Could anyone help me understand what does "Clyde Phillips" appear in the results for "Bill Cl"??
> 
> "Clyde Phillips" doesn't produce any EdgeNGram that would match "Bill Cl", so why is it even in the results?
> 
> Thanks.
> 
> --- On Thu, 11/11/10, Ahmet Arslan <io...@yahoo.com> wrote:
> 
>> You can add an additional field, with
>> using KeywordTokenizerFactory instead of
>> WhitespaceTokenizerFactory. And query both these fields with
>> an OR operator. 
>> 
>> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
>> 
>> You can even apply boost so that begins with matches comes
>> first.
>> 
>> --- On Thu, 11/11/10, Robert Gründler <ro...@dubture.com>
>> wrote:
>> 
>>> From: Robert Gründler <ro...@dubture.com>
>>> Subject: EdgeNGram relevancy
>>> To: solr-user@lucene.apache.org
>>> Date: Thursday, November 11, 2010, 5:51 PM
>>> Hi,
>>> 
>>> consider the following fieldtype (used for
>>> autocompletion):
>>> 
>>>   <fieldType name="edgytext"
>> class="solr.TextField"
>>> positionIncrementGap="100">
>>>    <analyzer type="index">
>>>      <tokenizer
>>> class="solr.WhitespaceTokenizerFactory"/>
>>>      <filter
>>> class="solr.LowerCaseFilterFactory"/>
>>>      <filter
>>> class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true"
>>> />     
>>>          <filter
>>> class="solr.PatternReplaceFilterFactory"
>> pattern="([^a-z])"
>>> replacement="" replace="all" />
>>>      <filter
>>> class="solr.EdgeNGramFilterFactory" minGramSize="1"
>>> maxGramSize="25" />
>>>    </analyzer>
>>>    <analyzer type="query">
>>>      <tokenizer
>>> class="solr.WhitespaceTokenizerFactory"/>
>>>      <filter
>>> class="solr.LowerCaseFilterFactory"/>
>>>      <filter
>>> class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords.txt" enablePositionIncrements="true"
>> />
>>>          <filter
>>> class="solr.PatternReplaceFilterFactory"
>> pattern="([^a-z])"
>>> replacement="" replace="all" />
>>>    </analyzer>
>>>   </fieldType>
>>> 
>>> 
>>> This works fine as long as the query string is a
>> single
>>> word. For multiple words, the ranking is weird
>> though.
>>> 
>>> Example:
>>> 
>>> Query String: "Bill Cl"
>>> 
>>> Result (in that order):
>>> 
>>> - Clyde Phillips
>>> - Clay Rogers
>>> - Roger Cloud
>>> - Bill Clinton
>>> 
>>> "Bill Clinton" should have the highest rank in that
>>> case.  
>>> 
>>> Has anyone an idea how to to configure this fieldtype
>> to
>>> make matches in both tokens rank higher than those who
>> match
>>> in either token?
>>> 
>>> 
>>> thanks!
>>> 
>>> 
>>> -robert
>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
>> 
> 
> 
>

Re: EdgeNGram relevancy

Posted by Andy <an...@yahoo.com>.

Could anyone help me understand what does "Clyde Phillips" appear in the results for "Bill Cl"??

"Clyde Phillips" doesn't produce any EdgeNGram that would match "Bill Cl", so why is it even in the results?

Thanks.

--- On Thu, 11/11/10, Ahmet Arslan <io...@yahoo.com> wrote:

> You can add an additional field, with
> using KeywordTokenizerFactory instead of
> WhitespaceTokenizerFactory. And query both these fields with
> an OR operator. 
> 
> edgytext:(Bill Cl) OR edgytext2:"Bill Cl"
> 
> You can even apply boost so that begins with matches comes
> first.
> 
> --- On Thu, 11/11/10, Robert Gründler <ro...@dubture.com>
> wrote:
> 
> > From: Robert Gründler <ro...@dubture.com>
> > Subject: EdgeNGram relevancy
> > To: solr-user@lucene.apache.org
> > Date: Thursday, November 11, 2010, 5:51 PM
> > Hi,
> > 
> > consider the following fieldtype (used for
> > autocompletion):
> > 
> >   <fieldType name="edgytext"
> class="solr.TextField"
> > positionIncrementGap="100">
> >    <analyzer type="index">
> >      <tokenizer
> > class="solr.WhitespaceTokenizerFactory"/>
> >      <filter
> > class="solr.LowerCaseFilterFactory"/>
> >      <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"
> > />     
> >          <filter
> > class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z])"
> > replacement="" replace="all" />
> >      <filter
> > class="solr.EdgeNGramFilterFactory" minGramSize="1"
> > maxGramSize="25" />
> >    </analyzer>
> >    <analyzer type="query">
> >      <tokenizer
> > class="solr.WhitespaceTokenizerFactory"/>
> >      <filter
> > class="solr.LowerCaseFilterFactory"/>
> >      <filter
> > class="solr.StopFilterFactory" ignoreCase="true"
> > words="stopwords.txt" enablePositionIncrements="true"
> />
> >          <filter
> > class="solr.PatternReplaceFilterFactory"
> pattern="([^a-z])"
> > replacement="" replace="all" />
> >    </analyzer>
> >   </fieldType>
> > 
> > 
> > This works fine as long as the query string is a
> single
> > word. For multiple words, the ranking is weird
> though.
> > 
> > Example:
> > 
> > Query String: "Bill Cl"
> > 
> > Result (in that order):
> > 
> > - Clyde Phillips
> > - Clay Rogers
> > - Roger Cloud
> > - Bill Clinton
> > 
> > "Bill Clinton" should have the highest rank in that
> > case.  
> > 
> > Has anyone an idea how to to configure this fieldtype
> to
> > make matches in both tokens rank higher than those who
> match
> > in either token?
> > 
> > 
> > thanks!
> > 
> > 
> > -robert
> > 
> > 
> > 
> > 
> 
> 
> 
>

Re: EdgeNGram relevancy

Posted by Ahmet Arslan <io...@yahoo.com>.

You can add an additional field, with using KeywordTokenizerFactory instead of WhitespaceTokenizerFactory. And query both these fields with an OR operator. 

edgytext:(Bill Cl) OR edgytext2:"Bill Cl"

You can even apply boost so that begins with matches comes first.

--- On Thu, 11/11/10, Robert Gründler <ro...@dubture.com> wrote:

> From: Robert Gründler <ro...@dubture.com>
> Subject: EdgeNGram relevancy
> To: solr-user@lucene.apache.org
> Date: Thursday, November 11, 2010, 5:51 PM
> Hi,
> 
> consider the following fieldtype (used for
> autocompletion):
> 
>   <fieldType name="edgytext" class="solr.TextField"
> positionIncrementGap="100">
>    <analyzer type="index">
>      <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>      <filter
> class="solr.LowerCaseFilterFactory"/>
>      <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"
> />     
>          <filter
> class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
> replacement="" replace="all" />
>      <filter
> class="solr.EdgeNGramFilterFactory" minGramSize="1"
> maxGramSize="25" />
>    </analyzer>
>    <analyzer type="query">
>      <tokenizer
> class="solr.WhitespaceTokenizerFactory"/>
>      <filter
> class="solr.LowerCaseFilterFactory"/>
>      <filter
> class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true" />
>          <filter
> class="solr.PatternReplaceFilterFactory" pattern="([^a-z])"
> replacement="" replace="all" />
>    </analyzer>
>   </fieldType>
> 
> 
> This works fine as long as the query string is a single
> word. For multiple words, the ranking is weird though.
> 
> Example:
> 
> Query String: "Bill Cl"
> 
> Result (in that order):
> 
> - Clyde Phillips
> - Clay Rogers
> - Roger Cloud
> - Bill Clinton
> 
> "Bill Clinton" should have the highest rank in that
> case.  
> 
> Has anyone an idea how to to configure this fieldtype to
> make matches in both tokens rank higher than those who match
> in either token?
> 
> 
> thanks!
> 
> 
> -robert
> 
> 
> 
>