You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by yandong yao <yy...@gmail.com> on 2010/08/09 15:57:24 UTC

how to support "implicit trailing wildcards"

Hi everyone,


How to support 'implicit trailing wildcard *' using Solr, eg: using Google
to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
will be matched.

>From my point of view, there are several ways, both with disadvantages:

1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index
size increases dramatically, b) will matches even has no relationship, such
as such 'mount' will match 'mountain' also.

2) Using two pass searching: first pass searches term dictionary through
TermsComponent using given keyword, then using the first matched term from
term dictionary to search again. eg: when user enter 'umoun', TermsComponent
will match 'umount', then use 'umount' to search. The disadvantage are: a)
need to parse query string so that could recognize meta keywords such as
'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
client), b) The returned hit counts is not for original search string, thus
will influence other components such as auto-suggest component based on user
search history and hit counts.

3) Write custom SearchComponent, while have no idea where/how to start with.

Is there any other way in Solr to do this, any feedback/suggestion are
welcome!

Thanks very much in advance!

Re: how to support "implicit trailing wildcards"

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.

I guess q=mount OR (mount*)^0.01 would work equally as well, i.e. diminishing the effect of wildcard matches.
--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 11. aug. 2010, at 17.53, yandong yao wrote:

> Hi Jan,
> 
> Seems q=mount OR mount* have different sorting order with q=mount for those
> documents including mount.
> Change to  q=mount^100 OR (mount?* -mount)^1.0, and test well.
> 
> Thanks very much!
> 
> 2010/8/10 Jan Høydahl / Cominvent <ja...@cominvent.com>
> 
>> Hi,
>> 
>> You don't need to duplicate the content into two fields to achieve this.
>> Try this:
>> 
>> q=mount OR mount*
>> 
>> The exact match will always get higher score than the wildcard match
>> because wildcard matches uses "constant score".
>> 
>> Making this work for multi term queries is a bit trickier, but something
>> along these lines:
>> 
>> q=(mount OR mount*) AND (everest OR everest*)
>> 
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Training in Europe - www.solrtraining.com
>> 
>> On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:
>> 
>>> you could satisfy this by making 2 fields:
>>> 1. exactmatch
>>> 2. wildcardmatch
>>> 
>>> use copyfield in your schema to copy 1 --> 2 .
>>> 
>>> q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
>>> this would score exact matches above (solely) wildcard matches
>>> 
>>> Geert-Jan
>>> 
>>> 2010/8/10 yandong yao <yy...@gmail.com>
>>> 
>>>> Hi Bastian,
>>>> 
>>>> Sorry for not make it clear, I also want exact match have higher score
>> than
>>>> wildcard match, that is means: if searching 'mount', documents with
>> 'mount'
>>>> will have higher score than documents with 'mountain', while 'mount*'
>> seems
>>>> treat 'mount' and 'mountain' as same.
>>>> 
>>>> besides, also want the query to be processed with analyzer, while from
>>>> 
>>>> 
>> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
>>>> ,
>>>> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
>>>> The
>>>> rationale is that if search 'mounted', I also want documents with
>> 'mount'
>>>> match.
>>>> 
>>>> So seems built-in wildcard search could not satisfy my requirements if i
>>>> understand correctly.
>>>> 
>>>> Thanks very much!
>>>> 
>>>> 
>>>> 2010/8/9 Bastian Spitzer <bs...@magix.net>
>>>> 
>>>>> Wildcard-Search is already built in, just use:
>>>>> 
>>>>> ?q=umoun*
>>>>> ?q=mounta*
>>>>> 
>>>>> -----Ursprüngliche Nachricht-----
>>>>> Von: yandong yao [mailto:yydzero@gmail.com]
>>>>> Gesendet: Montag, 9. August 2010 15:57
>>>>> An: solr-user@lucene.apache.org
>>>>> Betreff: how to support "implicit trailing wildcards"
>>>>> 
>>>>> Hi everyone,
>>>>> 
>>>>> 
>>>>> How to support 'implicit trailing wildcard *' using Solr, eg: using
>>>> Google
>>>>> to search 'umoun', 'umount' will be matched , search 'mounta',
>> 'mountain'
>>>>> will be matched.
>>>>> 
>>>>> From my point of view, there are several ways, both with disadvantages:
>>>>> 
>>>>> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with
>> 'u',
>>>>> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
>>>> index
>>>>> size increases dramatically, b) will matches even has no relationship,
>>>> such
>>>>> as such 'mount' will match 'mountain' also.
>>>>> 
>>>>> 2) Using two pass searching: first pass searches term dictionary
>> through
>>>>> TermsComponent using given keyword, then using the first matched term
>>>> from
>>>>> term dictionary to search again. eg: when user enter 'umoun',
>>>> TermsComponent
>>>>> will match 'umount', then use 'umount' to search. The disadvantage are:
>>>> a)
>>>>> need to parse query string so that could recognize meta keywords such
>> as
>>>>> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
>>>>> client), b) The returned hit counts is not for original search string,
>>>> thus
>>>>> will influence other components such as auto-suggest component based on
>>>> user
>>>>> search history and hit counts.
>>>>> 
>>>>> 3) Write custom SearchComponent, while have no idea where/how to start
>>>>> with.
>>>>> 
>>>>> Is there any other way in Solr to do this, any feedback/suggestion are
>>>>> welcome!
>>>>> 
>>>>> Thanks very much in advance!
>>>>> 
>>>> 
>> 
>>

Re: how to support "implicit trailing wildcards"

Posted by yandong yao <yy...@gmail.com>.

Hi Jan,

Seems q=mount OR mount* have different sorting order with q=mount for those
documents including mount.
Change to  q=mount^100 OR (mount?* -mount)^1.0, and test well.

Thanks very much!

2010/8/10 Jan Høydahl / Cominvent <ja...@cominvent.com>

> Hi,
>
> You don't need to duplicate the content into two fields to achieve this.
> Try this:
>
> q=mount OR mount*
>
> The exact match will always get higher score than the wildcard match
> because wildcard matches uses "constant score".
>
> Making this work for multi term queries is a bit trickier, but something
> along these lines:
>
> q=(mount OR mount*) AND (everest OR everest*)
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Training in Europe - www.solrtraining.com
>
> On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:
>
> > you could satisfy this by making 2 fields:
> > 1. exactmatch
> > 2. wildcardmatch
> >
> > use copyfield in your schema to copy 1 --> 2 .
> >
> > q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
> > this would score exact matches above (solely) wildcard matches
> >
> > Geert-Jan
> >
> > 2010/8/10 yandong yao <yy...@gmail.com>
> >
> >> Hi Bastian,
> >>
> >> Sorry for not make it clear, I also want exact match have higher score
> than
> >> wildcard match, that is means: if searching 'mount', documents with
> 'mount'
> >> will have higher score than documents with 'mountain', while 'mount*'
> seems
> >> treat 'mount' and 'mountain' as same.
> >>
> >> besides, also want the query to be processed with analyzer, while from
> >>
> >>
> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
> >> ,
> >> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
> >> The
> >> rationale is that if search 'mounted', I also want documents with
> 'mount'
> >> match.
> >>
> >> So seems built-in wildcard search could not satisfy my requirements if i
> >> understand correctly.
> >>
> >> Thanks very much!
> >>
> >>
> >> 2010/8/9 Bastian Spitzer <bs...@magix.net>
> >>
> >>> Wildcard-Search is already built in, just use:
> >>>
> >>> ?q=umoun*
> >>> ?q=mounta*
> >>>
> >>> -----Ursprüngliche Nachricht-----
> >>> Von: yandong yao [mailto:yydzero@gmail.com]
> >>> Gesendet: Montag, 9. August 2010 15:57
> >>> An: solr-user@lucene.apache.org
> >>> Betreff: how to support "implicit trailing wildcards"
> >>>
> >>> Hi everyone,
> >>>
> >>>
> >>> How to support 'implicit trailing wildcard *' using Solr, eg: using
> >> Google
> >>> to search 'umoun', 'umount' will be matched , search 'mounta',
> 'mountain'
> >>> will be matched.
> >>>
> >>> From my point of view, there are several ways, both with disadvantages:
> >>>
> >>> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with
> 'u',
> >>> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
> >> index
> >>> size increases dramatically, b) will matches even has no relationship,
> >> such
> >>> as such 'mount' will match 'mountain' also.
> >>>
> >>> 2) Using two pass searching: first pass searches term dictionary
> through
> >>> TermsComponent using given keyword, then using the first matched term
> >> from
> >>> term dictionary to search again. eg: when user enter 'umoun',
> >> TermsComponent
> >>> will match 'umount', then use 'umount' to search. The disadvantage are:
> >> a)
> >>> need to parse query string so that could recognize meta keywords such
> as
> >>> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> >>> client), b) The returned hit counts is not for original search string,
> >> thus
> >>> will influence other components such as auto-suggest component based on
> >> user
> >>> search history and hit counts.
> >>>
> >>> 3) Write custom SearchComponent, while have no idea where/how to start
> >>> with.
> >>>
> >>> Is there any other way in Solr to do this, any feedback/suggestion are
> >>> welcome!
> >>>
> >>> Thanks very much in advance!
> >>>
> >>
>
>

Re: how to support "implicit trailing wildcards"

Posted by Jan Høydahl / Cominvent <ja...@cominvent.com>.

Hi,

You don't need to duplicate the content into two fields to achieve this. Try this:

q=mount OR mount*

The exact match will always get higher score than the wildcard match because wildcard matches uses "constant score".

Making this work for multi term queries is a bit trickier, but something along these lines:

q=(mount OR mount*) AND (everest OR everest*)

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Training in Europe - www.solrtraining.com

On 10. aug. 2010, at 09.38, Geert-Jan Brits wrote:

> you could satisfy this by making 2 fields:
> 1. exactmatch
> 2. wildcardmatch
> 
> use copyfield in your schema to copy 1 --> 2 .
> 
> q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
> this would score exact matches above (solely) wildcard matches
> 
> Geert-Jan
> 
> 2010/8/10 yandong yao <yy...@gmail.com>
> 
>> Hi Bastian,
>> 
>> Sorry for not make it clear, I also want exact match have higher score than
>> wildcard match, that is means: if searching 'mount', documents with 'mount'
>> will have higher score than documents with 'mountain', while 'mount*' seems
>> treat 'mount' and 'mountain' as same.
>> 
>> besides, also want the query to be processed with analyzer, while from
>> 
>> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
>> ,
>> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
>> The
>> rationale is that if search 'mounted', I also want documents with 'mount'
>> match.
>> 
>> So seems built-in wildcard search could not satisfy my requirements if i
>> understand correctly.
>> 
>> Thanks very much!
>> 
>> 
>> 2010/8/9 Bastian Spitzer <bs...@magix.net>
>> 
>>> Wildcard-Search is already built in, just use:
>>> 
>>> ?q=umoun*
>>> ?q=mounta*
>>> 
>>> -----Ursprüngliche Nachricht-----
>>> Von: yandong yao [mailto:yydzero@gmail.com]
>>> Gesendet: Montag, 9. August 2010 15:57
>>> An: solr-user@lucene.apache.org
>>> Betreff: how to support "implicit trailing wildcards"
>>> 
>>> Hi everyone,
>>> 
>>> 
>>> How to support 'implicit trailing wildcard *' using Solr, eg: using
>> Google
>>> to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
>>> will be matched.
>>> 
>>> From my point of view, there are several ways, both with disadvantages:
>>> 
>>> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
>>> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
>> index
>>> size increases dramatically, b) will matches even has no relationship,
>> such
>>> as such 'mount' will match 'mountain' also.
>>> 
>>> 2) Using two pass searching: first pass searches term dictionary through
>>> TermsComponent using given keyword, then using the first matched term
>> from
>>> term dictionary to search again. eg: when user enter 'umoun',
>> TermsComponent
>>> will match 'umount', then use 'umount' to search. The disadvantage are:
>> a)
>>> need to parse query string so that could recognize meta keywords such as
>>> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
>>> client), b) The returned hit counts is not for original search string,
>> thus
>>> will influence other components such as auto-suggest component based on
>> user
>>> search history and hit counts.
>>> 
>>> 3) Write custom SearchComponent, while have no idea where/how to start
>>> with.
>>> 
>>> Is there any other way in Solr to do this, any feedback/suggestion are
>>> welcome!
>>> 
>>> Thanks very much in advance!
>>> 
>>

Re: how to support "implicit trailing wildcards"

Posted by Geert-Jan Brits <gb...@gmail.com>.

you could satisfy this by making 2 fields:
1. exactmatch
2. wildcardmatch

use copyfield in your schema to copy 1 --> 2 .

q=exactmatch:mount+wildcardmatch:mount*&q.op=OR
this would score exact matches above (solely) wildcard matches

Geert-Jan

2010/8/10 yandong yao <yy...@gmail.com>

> Hi Bastian,
>
> Sorry for not make it clear, I also want exact match have higher score than
> wildcard match, that is means: if searching 'mount', documents with 'mount'
> will have higher score than documents with 'mountain', while 'mount*' seems
> treat 'mount' and 'mountain' as same.
>
> besides, also want the query to be processed with analyzer, while from
>
> http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
> ,
> Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer.
> The
> rationale is that if search 'mounted', I also want documents with 'mount'
> match.
>
> So seems built-in wildcard search could not satisfy my requirements if i
> understand correctly.
>
> Thanks very much!
>
>
> 2010/8/9 Bastian Spitzer <bs...@magix.net>
>
> > Wildcard-Search is already built in, just use:
> >
> > ?q=umoun*
> > ?q=mounta*
> >
> > -----Ursprüngliche Nachricht-----
> > Von: yandong yao [mailto:yydzero@gmail.com]
> > Gesendet: Montag, 9. August 2010 15:57
> > An: solr-user@lucene.apache.org
> > Betreff: how to support "implicit trailing wildcards"
> >
> > Hi everyone,
> >
> >
> > How to support 'implicit trailing wildcard *' using Solr, eg: using
> Google
> > to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
> > will be matched.
> >
> > From my point of view, there are several ways, both with disadvantages:
> >
> > 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
> > 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the
> index
> > size increases dramatically, b) will matches even has no relationship,
> such
> > as such 'mount' will match 'mountain' also.
> >
> > 2) Using two pass searching: first pass searches term dictionary through
> > TermsComponent using given keyword, then using the first matched term
> from
> > term dictionary to search again. eg: when user enter 'umoun',
> TermsComponent
> > will match 'umount', then use 'umount' to search. The disadvantage are:
> a)
> > need to parse query string so that could recognize meta keywords such as
> > 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> > client), b) The returned hit counts is not for original search string,
> thus
> > will influence other components such as auto-suggest component based on
> user
> > search history and hit counts.
> >
> > 3) Write custom SearchComponent, while have no idea where/how to start
> > with.
> >
> > Is there any other way in Solr to do this, any feedback/suggestion are
> > welcome!
> >
> > Thanks very much in advance!
> >
>

Re: how to support "implicit trailing wildcards"

Posted by yandong yao <yy...@gmail.com>.

Hi Bastian,

Sorry for not make it clear, I also want exact match have higher score than
wildcard match, that is means: if searching 'mount', documents with 'mount'
will have higher score than documents with 'mountain', while 'mount*' seems
treat 'mount' and 'mountain' as same.

besides, also want the query to be processed with analyzer, while from
http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F,
Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer. The
rationale is that if search 'mounted', I also want documents with 'mount'
match.

So seems built-in wildcard search could not satisfy my requirements if i
understand correctly.

Thanks very much!


2010/8/9 Bastian Spitzer <bs...@magix.net>

> Wildcard-Search is already built in, just use:
>
> ?q=umoun*
> ?q=mounta*
>
> -----Ursprüngliche Nachricht-----
> Von: yandong yao [mailto:yydzero@gmail.com]
> Gesendet: Montag, 9. August 2010 15:57
> An: solr-user@lucene.apache.org
> Betreff: how to support "implicit trailing wildcards"
>
> Hi everyone,
>
>
> How to support 'implicit trailing wildcard *' using Solr, eg: using Google
> to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
> will be matched.
>
> From my point of view, there are several ways, both with disadvantages:
>
> 1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u',
> 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index
> size increases dramatically, b) will matches even has no relationship, such
> as such 'mount' will match 'mountain' also.
>
> 2) Using two pass searching: first pass searches term dictionary through
> TermsComponent using given keyword, then using the first matched term from
> term dictionary to search again. eg: when user enter 'umoun', TermsComponent
> will match 'umount', then use 'umount' to search. The disadvantage are: a)
> need to parse query string so that could recognize meta keywords such as
> 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP
> client), b) The returned hit counts is not for original search string, thus
> will influence other components such as auto-suggest component based on user
> search history and hit counts.
>
> 3) Write custom SearchComponent, while have no idea where/how to start
> with.
>
> Is there any other way in Solr to do this, any feedback/suggestion are
> welcome!
>
> Thanks very much in advance!
>

AW: how to support "implicit trailing wildcards"

Posted by Bastian Spitzer <bs...@magix.net>.

Wildcard-Search is already built in, just use:

?q=umoun*
?q=mounta*

-----Ursprüngliche Nachricht-----
Von: yandong yao [mailto:yydzero@gmail.com] 
Gesendet: Montag, 9. August 2010 15:57
An: solr-user@lucene.apache.org
Betreff: how to support "implicit trailing wildcards"

Hi everyone,


How to support 'implicit trailing wildcard *' using Solr, eg: using Google to search 'umoun', 'umount' will be matched , search 'mounta', 'mountain'
will be matched.

>From my point of view, there are several ways, both with disadvantages:

1) Using EdgeNGramFilterFactory, thus 'umount' will be indexed with 'u', 'um', 'umo', 'umou', 'umoun', 'umount'. The disadvantages are: a) the index size increases dramatically, b) will matches even has no relationship, such as such 'mount' will match 'mountain' also.

2) Using two pass searching: first pass searches term dictionary through TermsComponent using given keyword, then using the first matched term from term dictionary to search again. eg: when user enter 'umoun', TermsComponent will match 'umount', then use 'umount' to search. The disadvantage are: a) need to parse query string so that could recognize meta keywords such as 'AND', 'OR', '+', '-', '"' (this makes more complex as I am using PHP client), b) The returned hit counts is not for original search string, thus will influence other components such as auto-suggest component based on user search history and hit counts.

3) Write custom SearchComponent, while have no idea where/how to start with.

Is there any other way in Solr to do this, any feedback/suggestion are welcome!

Thanks very much in advance!