You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Shashi Kant <sk...@sloan.mit.edu> on 2011/12/29 19:06:32 UTC

Re: Solr, SQL Server's LIKE

for a simple, hackish (albeit inefficient) approach look up wildcard searchers

e,g foo*, *bar



On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
<db...@nationalcorp.com> wrote:
> I have been tinkering with Solr for a few weeks, and I am convinced that it could be very helpful in many of my upcoming projects. I am trying to decide whether Solr is appropriate for this one, and I haven't had luck looking for answers on Google.
>
> I need to search a list of names of companies and individuals pretty exactly. T-SQL's LIKE operator does this with decent performance, but I have a feeling there is a way to configure Solr to do this better. I've tried using an edge N-gram tokenizer, but it feels like it might be more complicated than necessary. What would you suggest?
>
> I know this sounds kind of 'Golden Hammer,' but there has been talk of other, more complicated (magic) searches that I don't think SQL Server can handle, since its tokens (as far as I know) can't be smaller than one word.
>
> Thanks,
>
> Devon Baumgarten
>

Re: Solr, SQL Server's LIKE

Posted by Chantal Ackermann <ch...@btelligent.de>.

Thanks, Erick! That sounds great. I really do have to upgrade.

Chantal


On Sun, 2012-01-01 at 16:42 +0100, Erick Erickson wrote:
> Chantal:
> 
> bq: The problem with the wildcard searches is that the input is not
> analyzed.
> 
> As of 3.6/4.0, this is no longer entirely true. Some analysis is
> performed for wildcard searches by default and you can
> specify most anything you want if you really need to see:
> https://issues.apache.org/jira/browse/SOLR-2438
> and
> http://wiki.apache.org/solr/MultitermQueryAnalysis
> 
> Best
> Erick

Re: Solr, SQL Server's LIKE

Posted by Erick Erickson <er...@gmail.com>.

Chantal:

bq: The problem with the wildcard searches is that the input is not
analyzed.

As of 3.6/4.0, this is no longer entirely true. Some analysis is
performed for wildcard searches by default and you can
specify most anything you want if you really need to see:
https://issues.apache.org/jira/browse/SOLR-2438
and
http://wiki.apache.org/solr/MultitermQueryAnalysis

Best
Erick

On Fri, Dec 30, 2011 at 4:33 PM, Devon Baumgarten
<db...@nationalcorp.com> wrote:
> Hoss,
>
> Thanks. You've answered my question. To clarify, what I should have asked for instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me that I didn't need n-grams to use the wildcard. You asking for me to clarify what I meant made me realize that the n-grams are the source of all my current problems. :)
>
> Thanks!
>
> Devon Baumgarten
>
>
> -----Original Message-----
> From: Chris Hostetter [mailto:hossman_lucene@fucit.org]
> Sent: Thursday, December 29, 2011 7:00 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Solr, SQL Server's LIKE
>
>
> : Thanks. I know I'll be able to utilize some of Solr's free text
> : searching capabilities in other search types in this project. The
> : product manager wants this particular search to exactly mimic LIKE%.
>        ...
> : Ex: If I search "Albatross" I want "Albert" to be excluded completely,
> : rather than having a low score.
>
> please be specific about the types of queries you want. ie: we need more
> then one example of the type of input you want to provide, the type of
> matches you want to see for that input, and the type of matches you want
> to get back.
>
> in your first message you said you need to match company titles "pretty
> exactly" but then seem to contradict yourself by saying the SQL's LIKE
> command fit's the bill -- even though the SQL LIKE command exists
> specificly for in-exact matches on field values.
>
> Based on your one example above of Albatross, you don't need anything
> special: don't use ngrams, don't use stemming, don't use fuzzy anything --
> just search for "Albatross" and it will match "Albatross" but not
> "Albert".  if you want "Albatross" to match "Albatross Road" use some
> basic tokenization.
>
> If all you really care about is prefix searching (which seems suggested by
> your "LIKE%" comment above, which i'm guessing is shorthand for something
> similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both
> match "abcdef" and "abcdzzzz" but neither of them match "xxxxabcdyyyy"
> then just use prefix queries (ie: "abcd*") -- they should be plenty
> efficient for your purposes.  you only need to worry about ngrams when you
> want to efficiently match in the middle of a string. (ie: "TITLE LIKE
> %ABC%")
>
>
> -Hoss

RE: Solr, SQL Server's LIKE

Posted by Devon Baumgarten <db...@nationalcorp.com>.

Hoss,

Thanks. You've answered my question. To clarify, what I should have asked for instead of 'exact' was 'not fuzzy'. For some reason it didn't occur to me that I didn't need n-grams to use the wildcard. You asking for me to clarify what I meant made me realize that the n-grams are the source of all my current problems. :)

Thanks!

Devon Baumgarten


-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Thursday, December 29, 2011 7:00 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr, SQL Server's LIKE


: Thanks. I know I'll be able to utilize some of Solr's free text 
: searching capabilities in other search types in this project. The 
: product manager wants this particular search to exactly mimic LIKE%.
	...
: Ex: If I search "Albatross" I want "Albert" to be excluded completely, 
: rather than having a low score.

please be specific about the types of queries you want. ie: we need more 
then one example of the type of input you want to provide, the type of 
matches you want to see for that input, and the type of matches you want 
to get back.

in your first message you said you need to match company titles "pretty 
exactly" but then seem to contradict yourself by saying the SQL's LIKE 
command fit's the bill -- even though the SQL LIKE command exists 
specificly for in-exact matches on field values.

Based on your one example above of Albatross, you don't need anything 
special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
just search for "Albatross" and it will match "Albatross" but not 
"Albert".  if you want "Albatross" to match "Albatross Road" use some 
basic tokenization.

If all you really care about is prefix searching (which seems suggested by 
your "LIKE%" comment above, which i'm guessing is shorthand for something 
similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both 
match "abcdef" and "abcdzzzz" but neither of them match "xxxxabcdyyyy" 
then just use prefix queries (ie: "abcd*") -- they should be plenty 
efficient for your purposes.  you only need to worry about ngrams when you 
want to efficiently match in the middle of a string. (ie: "TITLE LIKE 
%ABC%")


-Hoss

RE: Solr, SQL Server's LIKE

Posted by Chantal Ackermann <ch...@btelligent.de>.

The problem with the wildcard searches is that the input is not
analyzed. For english, this might not be such a problem (except if you
expect case insenstive search). But than again, you don't get that with
like, either. Ngrams bring that and more.

What I think is often forgotten when comparing 'like' and Solr search
is:
Solr's analyzer allow not only for case insenstive search but also for
other analysis such as removing diacritics and this is also applied when
sorting (you have to create a separate index in the DB, as well, if you
want that).

Say you have the following names:
'Van Hinden'
'van Hinden'
'Música'
'Musil'

like 'mu%' - no hits
like 'Mu%' - 1 hit
like 'van%' - 1 hit
like 'hin%' - no hits

with Solr whitespace or standard tokenizer, ngrams and a diacritcs and
lowercase filter (no wildcard search):
'mu'/'Mu' - 2 hits sorted ignoring case and diacritics
'van' - 2 hits
'hin' - 2 hits


(This is written down from experience. I haven't checked those examples
explicitly.)

Cheers,
Chantal



On Fri, 2011-12-30 at 02:00 +0100, Chris Hostetter wrote:
> : Thanks. I know I'll be able to utilize some of Solr's free text 
> : searching capabilities in other search types in this project. The 
> : product manager wants this particular search to exactly mimic LIKE%.
> 	...
> : Ex: If I search "Albatross" I want "Albert" to be excluded completely, 
> : rather than having a low score.
> 
> please be specific about the types of queries you want. ie: we need more 
> then one example of the type of input you want to provide, the type of 
> matches you want to see for that input, and the type of matches you want 
> to get back.
> 
> in your first message you said you need to match company titles "pretty 
> exactly" but then seem to contradict yourself by saying the SQL's LIKE 
> command fit's the bill -- even though the SQL LIKE command exists 
> specificly for in-exact matches on field values.
> 
> Based on your one example above of Albatross, you don't need anything 
> special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
> just search for "Albatross" and it will match "Albatross" but not 
> "Albert".  if you want "Albatross" to match "Albatross Road" use some 
> basic tokenization.
> 
> If all you really care about is prefix searching (which seems suggested by 
> your "LIKE%" comment above, which i'm guessing is shorthand for something 
> similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both 
> match "abcdef" and "abcdzzzz" but neither of them match "xxxxabcdyyyy" 
> then just use prefix queries (ie: "abcd*") -- they should be plenty 
> efficient for your purposes.  you only need to worry about ngrams when you 
> want to efficiently match in the middle of a string. (ie: "TITLE LIKE 
> %ABC%")
> 
> 
> -Hoss

RE: Solr, SQL Server's LIKE

Posted by Chris Hostetter <ho...@fucit.org>.

: Thanks. I know I'll be able to utilize some of Solr's free text 
: searching capabilities in other search types in this project. The 
: product manager wants this particular search to exactly mimic LIKE%.
	...
: Ex: If I search "Albatross" I want "Albert" to be excluded completely, 
: rather than having a low score.

please be specific about the types of queries you want. ie: we need more 
then one example of the type of input you want to provide, the type of 
matches you want to see for that input, and the type of matches you want 
to get back.

in your first message you said you need to match company titles "pretty 
exactly" but then seem to contradict yourself by saying the SQL's LIKE 
command fit's the bill -- even though the SQL LIKE command exists 
specificly for in-exact matches on field values.

Based on your one example above of Albatross, you don't need anything 
special: don't use ngrams, don't use stemming, don't use fuzzy anything -- 
just search for "Albatross" and it will match "Albatross" but not 
"Albert".  if you want "Albatross" to match "Albatross Road" use some 
basic tokenization.

If all you really care about is prefix searching (which seems suggested by 
your "LIKE%" comment above, which i'm guessing is shorthand for something 
similar to "LIKE 'ABC%'"), so that queries like "abc" and "abcd" both 
match "abcdef" and "abcdzzzz" but neither of them match "xxxxabcdyyyy" 
then just use prefix queries (ie: "abcd*") -- they should be plenty 
efficient for your purposes.  you only need to worry about ngrams when you 
want to efficiently match in the middle of a string. (ie: "TITLE LIKE 
%ABC%")


-Hoss

RE: Solr, SQL Server's LIKE

Posted by Devon Baumgarten <db...@nationalcorp.com>.

Great suggestion! Thanks for keeping it simple for a complete Solr newbie.

I'm going to go try this right now.

Thanks!
Devon Baumgarten

-----Original Message-----
From: Shawn Heisey [mailto:solr@elyograg.org] 
Sent: Monday, January 02, 2012 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr, SQL Server's LIKE

On 12/29/2011 3:51 PM, Devon Baumgarten wrote:
> N-Grams get me pretty great results in general, but I don't want the results for this particular search to be fuzzy. How can I prevent the fuzzy matches from appearing?
>
> Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather than having a low score.

To achieve this while using the ngram filter, just do the ngram analysis 
on the index side, but not on the query side.  If you do this, you'll 
likely need a maxGramSize larger than would normally be required (which 
will make the index larger), and you might need to use the LengthFilter too.

Thanks,
Shawn

Re: Solr, SQL Server's LIKE

Posted by Shawn Heisey <so...@elyograg.org>.

On 12/29/2011 3:51 PM, Devon Baumgarten wrote:
> N-Grams get me pretty great results in general, but I don't want the results for this particular search to be fuzzy. How can I prevent the fuzzy matches from appearing?
>
> Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather than having a low score.

To achieve this while using the ngram filter, just do the ngram analysis 
on the index side, but not on the query side.  If you do this, you'll 
likely need a maxGramSize larger than would normally be required (which 
will make the index larger), and you might need to use the LengthFilter too.

Thanks,
Shawn

RE: Solr, SQL Server's LIKE

Posted by Devon Baumgarten <db...@nationalcorp.com>.

Erick,

Thanks. I know I'll be able to utilize some of Solr's free text searching capabilities in other search types in this project. The product manager wants this particular search to exactly mimic LIKE%.

N-Grams get me pretty great results in general, but I don't want the results for this particular search to be fuzzy. How can I prevent the fuzzy matches from appearing?

Ex: If I search "Albatross" I want "Albert" to be excluded completely, rather than having a low score.

Devon Baumgarten

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Thursday, December 29, 2011 3:44 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr, SQL Server's LIKE

SQLs "like" is usually handled with ngrams if you want
*stuff* kinds of searches. Wildcards are "interesting"
in Solr.

Things Solr handles that aren't easy in SQL
Phrases, phrases with slop, stemming,
synonyms. And, especially, some kind
of relevance ranking.

But Solr does NOT do the things SQL is best at,
things like joins etc. Each has it's sweet spot
and trying to make one do all the functions of the
other is fraught with places to go wrong.

Not a lot of help, but free text searching is what Solr is
all about, so if your problem maps into that space,
it's a great tool!

Best
Erick

On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant <sk...@sloan.mit.edu> wrote:
> for a simple, hackish (albeit inefficient) approach look up wildcard searchers
>
> e,g foo*, *bar
>
>
>
> On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
> <db...@nationalcorp.com> wrote:
>> I have been tinkering with Solr for a few weeks, and I am convinced that it could be very helpful in many of my upcoming projects. I am trying to decide whether Solr is appropriate for this one, and I haven't had luck looking for answers on Google.
>>
>> I need to search a list of names of companies and individuals pretty exactly. T-SQL's LIKE operator does this with decent performance, but I have a feeling there is a way to configure Solr to do this better. I've tried using an edge N-gram tokenizer, but it feels like it might be more complicated than necessary. What would you suggest?
>>
>> I know this sounds kind of 'Golden Hammer,' but there has been talk of other, more complicated (magic) searches that I don't think SQL Server can handle, since its tokens (as far as I know) can't be smaller than one word.
>>
>> Thanks,
>>
>> Devon Baumgarten
>>

Re: Solr, SQL Server's LIKE

Posted by Erick Erickson <er...@gmail.com>.

SQLs "like" is usually handled with ngrams if you want
*stuff* kinds of searches. Wildcards are "interesting"
in Solr.

Things Solr handles that aren't easy in SQL
Phrases, phrases with slop, stemming,
synonyms. And, especially, some kind
of relevance ranking.

But Solr does NOT do the things SQL is best at,
things like joins etc. Each has it's sweet spot
and trying to make one do all the functions of the
other is fraught with places to go wrong.

Not a lot of help, but free text searching is what Solr is
all about, so if your problem maps into that space,
it's a great tool!

Best
Erick

On Thu, Dec 29, 2011 at 1:06 PM, Shashi Kant <sk...@sloan.mit.edu> wrote:
> for a simple, hackish (albeit inefficient) approach look up wildcard searchers
>
> e,g foo*, *bar
>
>
>
> On Thu, Dec 29, 2011 at 12:38 PM, Devon Baumgarten
> <db...@nationalcorp.com> wrote:
>> I have been tinkering with Solr for a few weeks, and I am convinced that it could be very helpful in many of my upcoming projects. I am trying to decide whether Solr is appropriate for this one, and I haven't had luck looking for answers on Google.
>>
>> I need to search a list of names of companies and individuals pretty exactly. T-SQL's LIKE operator does this with decent performance, but I have a feeling there is a way to configure Solr to do this better. I've tried using an edge N-gram tokenizer, but it feels like it might be more complicated than necessary. What would you suggest?
>>
>> I know this sounds kind of 'Golden Hammer,' but there has been talk of other, more complicated (magic) searches that I don't think SQL Server can handle, since its tokens (as far as I know) can't be smaller than one word.
>>
>> Thanks,
>>
>> Devon Baumgarten
>>