You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Carsten L <cl...@fynskemedier.dk> on 2008/11/18 08:35:31 UTC

Use SOLR like the "MySQL LIKE"

Hello.

The data:
I have a dataset containing ~500.000 documents.
In each document there is an email, a name and an user ID.

The problem:
I would like to be able to search in it, but it should be like the "MySQL
LIKE".

So when a user enters the search term: "carsten", then the query looks like:
        "name:(carsten) OR name:(carsten*) OR email:(carsten) OR
email:(carsten*) OR userid:(carsten) OR userid:(carsten*)"

Then it should match:
carsten l
carsten larsen
Carsten Larsen
Carsten
CARSTEN
etc.

And when the user enters the term: "carsten l" the query looks like:
        "name:(carsten l) OR name:(carsten l*) OR email:(carsten l) OR
email:(carsten l*) OR userid:(carsten l) OR userid:(carsten l*)"

Then it should match:
carsten l
carsten larsen
Carsten Larsen

Or written to the MySQL syntax: "... WHERE `name` LIKE 'carsten%'  OR
`email` LIKE 'carsten%' OR `userid` LIKE 'carsten%'..."

I know that I need to use the "solr.LowerCaseTokenizerFactory" on my name
and email field, to ensure case insentitive behavior.
The problem seems to be the wildcards and the whitespaces.
-- 
View this message in context: http://www.nabble.com/Use-SOLR-like-the-%22MySQL-LIKE%22-tp20554732p20554732.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Use SOLR like the "MySQL LIKE"

Posted by Norberto Meijome <nu...@gmail.com>.

On Tue, 18 Nov 2008 14:26:02 +0100
"Aleksander M. Stensby" <al...@integrasco.no> wrote:

> Well, then I suggest you index the field in two different ways if you want  
> both possible ways of searching. One, where you treat the entire name as  
> one token (in lowercase) (then you can search for avera* and match on for  
> instance "average joe" etc.) And then another field where you tokenize on  
> whitespace for instance, if you want/need that possibility aswell. Look at  
> the solr copy fields and try it out, it works like a charm :)

You should also make extensive use of  analysis.jsp  to see how data in your
field (1) is tokenized, filtered and indexed, and how your search terms are
tokenized, filtered and matched against (1). 
Hint 1 : check all the checkboxes ;)
Hint 2: you don't need to reindex all your data, just enter test data in the
form and give it a go. You will of course have to tweak schema.xml and restart
your service when you do this.

good luck,
B
_________________________
{Beto|Norberto|Numard} Meijome

"Intellectual: 'Someone who has been educated beyond his/her intelligence'"
   Arthur C. Clarke, from "3001, The Final Odyssey", Sources.

I speak for myself, not my employer. Contents may be hot. Slippery when wet.
Reading disclaimers makes you go blind. Writing them is worse. You have been
Warned.

Re: Use SOLR like the "MySQL LIKE"

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.

Ah, okay!
Well, then I suggest you index the field in two different ways if you want  
both possible ways of searching. One, where you treat the entire name as  
one token (in lowercase) (then you can search for avera* and match on for  
instance "average joe" etc.) And then another field where you tokenize on  
whitespace for instance, if you want/need that possibility aswell. Look at  
the solr copy fields and try it out, it works like a charm :)

Cheers,
  Aleksander

On Tue, 18 Nov 2008 10:40:24 +0100, Carsten L <cl...@fynskemedier.dk> wrote:

>
> Thanks for the quick reply!
>
> It is supposed to work a little like the Google Suggest or field
> autocompletion.
>
> I know I mentioned email and userid, but the problem lies with the name
> field, because of the whitespaces in combination with the wildcard.
>
> I looked at the solr.WordDelimiterFilterFactory, but it does not mention
> anything about whitespaces - or wildcards.
>
> A quick brushup:
> I would like to mimic the LIKE functionality from MySQL using the  
> wildcards
> in the end of the searchquery.
> In MySQL whitespaces are treated as characters, not "splitters".
>
>
> Aleksander M. Stensby wrote:
>>
>> Hi there,
>>
>> You should use LowerCaseTokenizerFactory as you point out yourself. As  
>> far
>> as I know, the StandardTokenizer "recognizes email addresses and  
>> internet
>> hostnames as one token". In your case, I guess you want an email, say
>> "average.joe@apache.org" to be split into four tokens: average joe  
>> apache
>> org, or something like that, which would indeed allow you to search for
>> "joe" or "average j*" and match. To do so, you could use the
>> WordDelimiterFilterFactory and split on intra-word delimiters (I think  
>> the
>> defaults here are non-alphanumeric chars).
>>
>> Take a look at  
>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
>> for more info on tokenizers and filters.
>>
>> cheers,
>>   Aleks
>>
>> On Tue, 18 Nov 2008 08:35:31 +0100, Carsten L <cl...@fynskemedier.dk>  
>> wrote:
>>
>>>
>>> Hello.
>>>
>>> The data:
>>> I have a dataset containing ~500.000 documents.
>>> In each document there is an email, a name and an user ID.
>>>
>>> The problem:
>>> I would like to be able to search in it, but it should be like the  
>>> "MySQL
>>> LIKE".
>>>
>>> So when a user enters the search term: "carsten", then the query looks
>>> like:
>>>         "name:(carsten) OR name:(carsten*) OR email:(carsten) OR
>>> email:(carsten*) OR userid:(carsten) OR userid:(carsten*)"
>>>
>>> Then it should match:
>>> carsten l
>>> carsten larsen
>>> Carsten Larsen
>>> Carsten
>>> CARSTEN
>>> etc.
>>>
>>> And when the user enters the term: "carsten l" the query looks like:
>>>         "name:(carsten l) OR name:(carsten l*) OR email:(carsten l) OR
>>> email:(carsten l*) OR userid:(carsten l) OR userid:(carsten l*)"
>>>
>>> Then it should match:
>>> carsten l
>>> carsten larsen
>>> Carsten Larsen
>>>
>>> Or written to the MySQL syntax: "... WHERE `name` LIKE 'carsten%'  OR
>>> `email` LIKE 'carsten%' OR `userid` LIKE 'carsten%'..."
>>>
>>> I know that I need to use the "solr.LowerCaseTokenizerFactory" on my  
>>> name
>>> and email field, to ensure case insentitive behavior.
>>> The problem seems to be the wildcards and the whitespaces.
>>
>>
>>
>> --
>> Aleksander M. Stensby
>> Senior software developer
>> Integrasco A/S
>> www.integrasco.no
>>
>>
>



-- 
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no

Re: Use SOLR like the "MySQL LIKE"

Posted by Carsten L <cl...@fynskemedier.dk>.

Thanks for the quick reply!

It is supposed to work a little like the Google Suggest or field
autocompletion.

I know I mentioned email and userid, but the problem lies with the name
field, because of the whitespaces in combination with the wildcard.

I looked at the solr.WordDelimiterFilterFactory, but it does not mention
anything about whitespaces - or wildcards.

A quick brushup:
I would like to mimic the LIKE functionality from MySQL using the wildcards
in the end of the searchquery.
In MySQL whitespaces are treated as characters, not "splitters".


Aleksander M. Stensby wrote:
> 
> Hi there,
> 
> You should use LowerCaseTokenizerFactory as you point out yourself. As far  
> as I know, the StandardTokenizer "recognizes email addresses and internet  
> hostnames as one token". In your case, I guess you want an email, say  
> "average.joe@apache.org" to be split into four tokens: average joe apache  
> org, or something like that, which would indeed allow you to search for  
> "joe" or "average j*" and match. To do so, you could use the  
> WordDelimiterFilterFactory and split on intra-word delimiters (I think the  
> defaults here are non-alphanumeric chars).
> 
> Take a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters  
> for more info on tokenizers and filters.
> 
> cheers,
>   Aleks
> 
> On Tue, 18 Nov 2008 08:35:31 +0100, Carsten L <cl...@fynskemedier.dk> wrote:
> 
>>
>> Hello.
>>
>> The data:
>> I have a dataset containing ~500.000 documents.
>> In each document there is an email, a name and an user ID.
>>
>> The problem:
>> I would like to be able to search in it, but it should be like the "MySQL
>> LIKE".
>>
>> So when a user enters the search term: "carsten", then the query looks  
>> like:
>>         "name:(carsten) OR name:(carsten*) OR email:(carsten) OR
>> email:(carsten*) OR userid:(carsten) OR userid:(carsten*)"
>>
>> Then it should match:
>> carsten l
>> carsten larsen
>> Carsten Larsen
>> Carsten
>> CARSTEN
>> etc.
>>
>> And when the user enters the term: "carsten l" the query looks like:
>>         "name:(carsten l) OR name:(carsten l*) OR email:(carsten l) OR
>> email:(carsten l*) OR userid:(carsten l) OR userid:(carsten l*)"
>>
>> Then it should match:
>> carsten l
>> carsten larsen
>> Carsten Larsen
>>
>> Or written to the MySQL syntax: "... WHERE `name` LIKE 'carsten%'  OR
>> `email` LIKE 'carsten%' OR `userid` LIKE 'carsten%'..."
>>
>> I know that I need to use the "solr.LowerCaseTokenizerFactory" on my name
>> and email field, to ensure case insentitive behavior.
>> The problem seems to be the wildcards and the whitespaces.
> 
> 
> 
> -- 
> Aleksander M. Stensby
> Senior software developer
> Integrasco A/S
> www.integrasco.no
> 
> 

-- 
View this message in context: http://www.nabble.com/Use-SOLR-like-the-%22MySQL-LIKE%22-tp20554732p20556271.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: Use SOLR like the "MySQL LIKE"

Posted by "Aleksander M. Stensby" <al...@integrasco.no>.

Hi there,

You should use LowerCaseTokenizerFactory as you point out yourself. As far  
as I know, the StandardTokenizer "recognizes email addresses and internet  
hostnames as one token". In your case, I guess you want an email, say  
"average.joe@apache.org" to be split into four tokens: average joe apache  
org, or something like that, which would indeed allow you to search for  
"joe" or "average j*" and match. To do so, you could use the  
WordDelimiterFilterFactory and split on intra-word delimiters (I think the  
defaults here are non-alphanumeric chars).

Take a look at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters  
for more info on tokenizers and filters.

cheers,
  Aleks

On Tue, 18 Nov 2008 08:35:31 +0100, Carsten L <cl...@fynskemedier.dk> wrote:

>
> Hello.
>
> The data:
> I have a dataset containing ~500.000 documents.
> In each document there is an email, a name and an user ID.
>
> The problem:
> I would like to be able to search in it, but it should be like the "MySQL
> LIKE".
>
> So when a user enters the search term: "carsten", then the query looks  
> like:
>         "name:(carsten) OR name:(carsten*) OR email:(carsten) OR
> email:(carsten*) OR userid:(carsten) OR userid:(carsten*)"
>
> Then it should match:
> carsten l
> carsten larsen
> Carsten Larsen
> Carsten
> CARSTEN
> etc.
>
> And when the user enters the term: "carsten l" the query looks like:
>         "name:(carsten l) OR name:(carsten l*) OR email:(carsten l) OR
> email:(carsten l*) OR userid:(carsten l) OR userid:(carsten l*)"
>
> Then it should match:
> carsten l
> carsten larsen
> Carsten Larsen
>
> Or written to the MySQL syntax: "... WHERE `name` LIKE 'carsten%'  OR
> `email` LIKE 'carsten%' OR `userid` LIKE 'carsten%'..."
>
> I know that I need to use the "solr.LowerCaseTokenizerFactory" on my name
> and email field, to ensure case insentitive behavior.
> The problem seems to be the wildcards and the whitespaces.

-- 
Aleksander M. Stensby
Senior software developer
Integrasco A/S
www.integrasco.no