You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Alice <al...@gmail.com> on 2006/11/07 17:51:16 UTC

Help on search

Hello!

I am totally new to Lucene and I'm trying to use it with my web application.

 

What I'm doing is reading a table from my database line by line and indexing
the columns.

I read the users data such as First Name, Last Name, Email and so on.

 

I hava a field by column, such as: firstName="Frederich" lastName="Brown"

My intetion is to make users find other users using name, lastName or email.

What is the best way to do it?

If the user enter "Fred Brown", as I use queryParser, this string is broken
to "Fred" "Brown" and I search those tokens in every field.

 

As "Brown" was indexed, the user's registry is found.

But if the user enters "Fred", no user is found.

Why is that? I thought it would return the user "Frederich Brown" either.

Can someone help?

 

Thanks!

RE: Help on search

Posted by Alice <al...@gmail.com>.

Thank you all guys!

I'll work on that!

-----Original Message-----
From: Vladimir Olenin [mailto:VOlenin@cihi.ca] 
Sent: terça-feira, 7 de novembro de 2006 16:58
To: java-user@lucene.apache.org
Subject: RE: Help on search

You might actually try to look for some 'names database' (similar to
Wordnet). Someone has probably already compiled a list of english 'names'
and their common short forms (eg, 'Vlad' for 'Vladimir', 'Fred' for
'Frederich', etc). Alternatevly, compile such DB yourself (and don't forget
to publish it under OSS license afterwards!!!! :) ) - if there is none right
now it really can be useful for quite a few applications.

Once you have the DB your can do similar things as with 'english synomym' DB
(not sure what Lucene notations are for this) - lookup 'synonym' for the
name and thus be able to search all forms.... Or use such DB for custom
stemming, as Erick suggested.

Vlad

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Tuesday, November 07, 2006 1:27 PM
To: java-user@lucene.apache.org
Subject: Re: Help on search

<<<How can I make a rule for stemming names?>>>

That's the problem all right. And I don't have any good answer. Wildcarding
could work for you, but it's really the same question, isn't it? "How do you
transform a typed name into a wildcard query?" is, I think, equivalent to
your question above (assuming that wildcarding is really truncation. i.e.
Fred* and not Fred*i*). And if you can answer it, you can stem or
wildcard......

Could a really simple rule like "only use the first 4 letters" work? Of
course there are obvious names for which this wouldn't work. But is
something like that "good enough"?

There is one other possibility, soundex/metaphone. There's a discussion in
Lucene in Action about how to do this. The basic idea is that you index and
search on words that "sound like" what's in the document. Whether "sounds
like" makes Fred and Frederich equivalent is the unknown here.....

Not much help I know

Good luck
Erick

On 11/7/06, Alice <al...@gmail.com> wrote:
>
> Thank you for answering.
>
> I thought about stemming at index and search time, but Im dealing with 
> names.
> How can I make a rule for stemming names?
>
> I'm thinking about different names, I thought I could use Lucene to 
> help users find whomever they're looking for without having to spell 
> their friends names correctly or fully.
>
> That's why I gave the example about 'Frederich', it is very common 
> that users enter 'Fred'.
>
> I'm going to read about the wildcards.
>
> Anymore suggestions are very welcome.
>
> Thanks
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: terça-feira, 7 de novembro de 2006 15:12
> To: java-user@lucene.apache.org
> Subject: Re: Help on search
>
> You really have to get away from thinking about lucene like you would 
> a database, or you'll have endless problems <G>.
>
> If you want just a bag of name parts, why use separate fields? For 
> instance, index both firstname and lastname in a name field in each 
> document. Then all you have to do to find all documents that have 
> Brown in them is search on the single field name. There's no need to 
> construct complex or clauses, which is what I suspect you're doing 
> now.....
>
> You can also index firstname, lastname, and a third field "bagofnames"
> that
> contains both if you want.
>
> Vladimir is right, unless you take some actions you always get exact 
> matches when you search, so I'm not surprised that searching for Fred 
> does NOT return Frederich-containing documents.
>
> Whether wildcarding is your best strategy is debatable, it depends 
> upon your intent. You could consider stemming both at index and search 
> time.
> Generically, all that means is that you index "Fred" for "Frederich". 
> and at search time you search for "Fred". There are stemming analyzers 
> in Lucene, but whether they do what you want only you can decide. Be 
> aware that wildcard searches frequently run afoul of a TooManyClauses 
> exception.
> There
> is a bunch of information from people far more knowledgeable than me 
> in this mail archive if you search on "wildcard". It's a more 
> complicated matter than you might think at first.
>
> Note that the TooManyClauses exception very likely won't show up in 
> your test data sets. It won't show up until you put in a bunch of 
> data. See TooManyClauses in the javadoc.....
>
> I'd argue if you can create rules for wildcarding, you can create 
> rules for stemming By that I mean that if you have a programmatic way 
> to turn "Frederich" into "Fred", you can just index and search on 
> "Fred" equally easily and not have to deal with wildcard complexity.
>
> Of course, since I don't know what problem you're actually trying to 
> solve, this may be irrelevant....
>
> Best
> Erick
>
> On 11/7/06, Alice <al...@gmail.com> wrote:
> >
> > Hello!
> >
> > I am totally new to Lucene and I'm trying to use it with my web 
> > application.
> >
> >
> >
> > What I'm doing is reading a table from my database line by line and 
> > indexing the columns.
> >
> > I read the users data such as First Name, Last Name, Email and so on.
> >
> >
> >
> > I hava a field by column, such as: firstName="Frederich"
> lastName="Brown"
> >
> > My intetion is to make users find other users using name, lastName 
> > or email.
> >
> > What is the best way to do it?
> >
> > If the user enter "Fred Brown", as I use queryParser, this string is 
> > broken to "Fred" "Brown" and I search those tokens in every field.
> >
> >
> >
> > As "Brown" was indexed, the user's registry is found.
> >
> > But if the user enters "Fred", no user is found.
> >
> > Why is that? I thought it would return the user "Frederich Brown"
> either.
> >
> > Can someone help?
> >
> >
> >
> > Thanks!
> >
> >
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Help on search

Posted by Vladimir Olenin <VO...@cihi.ca>.

You might actually try to look for some 'names database' (similar to Wordnet). Someone has probably already compiled a list of english 'names' and their common short forms (eg, 'Vlad' for 'Vladimir', 'Fred' for 'Frederich', etc). Alternatevly, compile such DB yourself (and don't forget to publish it under OSS license afterwards!!!! :) ) - if there is none right now it really can be useful for quite a few applications.

Once you have the DB your can do similar things as with 'english synomym' DB (not sure what Lucene notations are for this) - lookup 'synonym' for the name and thus be able to search all forms.... Or use such DB for custom stemming, as Erick suggested.

Vlad

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: Tuesday, November 07, 2006 1:27 PM
To: java-user@lucene.apache.org
Subject: Re: Help on search

<<<How can I make a rule for stemming names?>>>

That's the problem all right. And I don't have any good answer. Wildcarding could work for you, but it's really the same question, isn't it? "How do you transform a typed name into a wildcard query?" is, I think, equivalent to your question above (assuming that wildcarding is really truncation. i.e.
Fred* and not Fred*i*). And if you can answer it, you can stem or wildcard......

Could a really simple rule like "only use the first 4 letters" work? Of course there are obvious names for which this wouldn't work. But is something like that "good enough"?

There is one other possibility, soundex/metaphone. There's a discussion in Lucene in Action about how to do this. The basic idea is that you index and search on words that "sound like" what's in the document. Whether "sounds like" makes Fred and Frederich equivalent is the unknown here.....

Not much help I know

Good luck
Erick

On 11/7/06, Alice <al...@gmail.com> wrote:
>
> Thank you for answering.
>
> I thought about stemming at index and search time, but Im dealing with 
> names.
> How can I make a rule for stemming names?
>
> I'm thinking about different names, I thought I could use Lucene to 
> help users find whomever they're looking for without having to spell 
> their friends names correctly or fully.
>
> That's why I gave the example about 'Frederich', it is very common 
> that users enter 'Fred'.
>
> I'm going to read about the wildcards.
>
> Anymore suggestions are very welcome.
>
> Thanks
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: terça-feira, 7 de novembro de 2006 15:12
> To: java-user@lucene.apache.org
> Subject: Re: Help on search
>
> You really have to get away from thinking about lucene like you would 
> a database, or you'll have endless problems <G>.
>
> If you want just a bag of name parts, why use separate fields? For 
> instance, index both firstname and lastname in a name field in each 
> document. Then all you have to do to find all documents that have 
> Brown in them is search on the single field name. There's no need to 
> construct complex or clauses, which is what I suspect you're doing 
> now.....
>
> You can also index firstname, lastname, and a third field "bagofnames"
> that
> contains both if you want.
>
> Vladimir is right, unless you take some actions you always get exact 
> matches when you search, so I'm not surprised that searching for Fred 
> does NOT return Frederich-containing documents.
>
> Whether wildcarding is your best strategy is debatable, it depends 
> upon your intent. You could consider stemming both at index and search 
> time.
> Generically, all that means is that you index "Fred" for "Frederich". 
> and at search time you search for "Fred". There are stemming analyzers 
> in Lucene, but whether they do what you want only you can decide. Be 
> aware that wildcard searches frequently run afoul of a TooManyClauses 
> exception.
> There
> is a bunch of information from people far more knowledgeable than me 
> in this mail archive if you search on "wildcard". It's a more 
> complicated matter than you might think at first.
>
> Note that the TooManyClauses exception very likely won't show up in 
> your test data sets. It won't show up until you put in a bunch of 
> data. See TooManyClauses in the javadoc.....
>
> I'd argue if you can create rules for wildcarding, you can create 
> rules for stemming By that I mean that if you have a programmatic way 
> to turn "Frederich" into "Fred", you can just index and search on 
> "Fred" equally easily and not have to deal with wildcard complexity.
>
> Of course, since I don't know what problem you're actually trying to 
> solve, this may be irrelevant....
>
> Best
> Erick
>
> On 11/7/06, Alice <al...@gmail.com> wrote:
> >
> > Hello!
> >
> > I am totally new to Lucene and I'm trying to use it with my web 
> > application.
> >
> >
> >
> > What I'm doing is reading a table from my database line by line and 
> > indexing the columns.
> >
> > I read the users data such as First Name, Last Name, Email and so on.
> >
> >
> >
> > I hava a field by column, such as: firstName="Frederich"
> lastName="Brown"
> >
> > My intetion is to make users find other users using name, lastName 
> > or email.
> >
> > What is the best way to do it?
> >
> > If the user enter "Fred Brown", as I use queryParser, this string is 
> > broken to "Fred" "Brown" and I search those tokens in every field.
> >
> >
> >
> > As "Brown" was indexed, the user's registry is found.
> >
> > But if the user enters "Fred", no user is found.
> >
> > Why is that? I thought it would return the user "Frederich Brown"
> either.
> >
> > Can someone help?
> >
> >
> >
> > Thanks!
> >
> >
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Help on search

Posted by Erick Erickson <er...@gmail.com>.

<<<How can I make a rule for stemming names?>>>

That's the problem all right. And I don't have any good answer. Wildcarding
could work for you, but it's really the same question, isn't it? "How do you
transform a typed name into a wildcard query?" is, I think, equivalent to
your question above (assuming that wildcarding is really truncation. i.e.
Fred* and not Fred*i*). And if you can answer it, you can stem or
wildcard......

Could a really simple rule like "only use the first 4 letters" work? Of
course there are obvious names for which this wouldn't work. But is
something like that "good enough"?

There is one other possibility, soundex/metaphone. There's a discussion in
Lucene in Action about how to do this. The basic idea is that you index and
search on words that "sound like" what's in the document. Whether "sounds
like" makes Fred and Frederich equivalent is the unknown here.....

Not much help I know

Good luck
Erick

On 11/7/06, Alice <al...@gmail.com> wrote:
>
> Thank you for answering.
>
> I thought about stemming at index and search time, but Im dealing with
> names.
> How can I make a rule for stemming names?
>
> I'm thinking about different names, I thought I could use Lucene to help
> users find whomever they're looking for without having to spell their
> friends names correctly or fully.
>
> That's why I gave the example about 'Frederich', it is very common that
> users enter 'Fred'.
>
> I'm going to read about the wildcards.
>
> Anymore suggestions are very welcome.
>
> Thanks
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: terça-feira, 7 de novembro de 2006 15:12
> To: java-user@lucene.apache.org
> Subject: Re: Help on search
>
> You really have to get away from thinking about lucene like you would a
> database, or you'll have endless problems <G>.
>
> If you want just a bag of name parts, why use separate fields? For
> instance,
> index both firstname and lastname in a name field in each document. Then
> all
> you have to do to find all documents that have Brown in them is search on
> the single field name. There's no need to construct complex or clauses,
> which is what I suspect you're doing now.....
>
> You can also index firstname, lastname, and a third field "bagofnames"
> that
> contains both if you want.
>
> Vladimir is right, unless you take some actions you always get exact
> matches
> when you search, so I'm not surprised that searching for Fred does NOT
> return Frederich-containing documents.
>
> Whether wildcarding is your best strategy is debatable, it depends upon
> your
> intent. You could consider stemming both at index and search time.
> Generically, all that means is that you index "Fred" for "Frederich". and
> at
> search time you search for "Fred". There are stemming analyzers in Lucene,
> but whether they do what you want only you can decide. Be aware that
> wildcard searches frequently run afoul of a TooManyClauses exception.
> There
> is a bunch of information from people far more knowledgeable than me in
> this
> mail archive if you search on "wildcard". It's a more complicated matter
> than you might think at first.
>
> Note that the TooManyClauses exception very likely won't show up in your
> test data sets. It won't show up until you put in a bunch of data. See
> TooManyClauses in the javadoc.....
>
> I'd argue if you can create rules for wildcarding, you can create rules
> for
> stemming By that I mean that if you have a programmatic way to turn
> "Frederich" into "Fred", you can just index and search on "Fred" equally
> easily and not have to deal with wildcard complexity.
>
> Of course, since I don't know what problem you're actually trying to
> solve,
> this may be irrelevant....
>
> Best
> Erick
>
> On 11/7/06, Alice <al...@gmail.com> wrote:
> >
> > Hello!
> >
> > I am totally new to Lucene and I'm trying to use it with my web
> > application.
> >
> >
> >
> > What I'm doing is reading a table from my database line by line and
> > indexing
> > the columns.
> >
> > I read the users data such as First Name, Last Name, Email and so on.
> >
> >
> >
> > I hava a field by column, such as: firstName="Frederich"
> lastName="Brown"
> >
> > My intetion is to make users find other users using name, lastName or
> > email.
> >
> > What is the best way to do it?
> >
> > If the user enter "Fred Brown", as I use queryParser, this string is
> > broken
> > to "Fred" "Brown" and I search those tokens in every field.
> >
> >
> >
> > As "Brown" was indexed, the user's registry is found.
> >
> > But if the user enters "Fred", no user is found.
> >
> > Why is that? I thought it would return the user "Frederich Brown"
> either.
> >
> > Can someone help?
> >
> >
> >
> > Thanks!
> >
> >
> >
> >
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Help on search

Posted by Dan Armbrust <da...@gmail.com>.

A few more google searches will probably turn up some reasonable lists 
of abbreviation rules or lists for common names - I found this right away:

(google cache link that converts pdf to html)

http://72.14.205.104/search?q=cache:dh7HGiQ-G4wJ:immigrants.byu.edu/Downloads/BritishNames.pdf+common+name+abbreviations&hl=en&gl=us&ct=clnk&cd=5

With a table such as this, you could write a tokenizer that would inject 
the abbreviated form of common names into your index in addition to the 
default form.

Or, you could index them in as an alternate field, then you would have 
more control at query time whether or not you wanted to match on 
abbreviations.

Dan


-- 
****************************
Daniel Armbrust
Biomedical Informatics
Mayo Clinic Rochester
daniel.armbrust(at)mayo.edu
http://informatics.mayo.edu/

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

RE: Help on search

Posted by Alice <al...@gmail.com>.

Thank you for answering.

I thought about stemming at index and search time, but Im dealing with
names.
How can I make a rule for stemming names? 

I'm thinking about different names, I thought I could use Lucene to help
users find whomever they're looking for without having to spell their
friends names correctly or fully.

Thats why I gave the example about 'Frederich', it is very common that
users enter 'Fred'.

I'm going to read about the wildcards.

Anymore suggestions are very welcome.

Thanks

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: terça-feira, 7 de novembro de 2006 15:12
To: java-user@lucene.apache.org
Subject: Re: Help on search

You really have to get away from thinking about lucene like you would a
database, or you'll have endless problems <G>.

If you want just a bag of name parts, why use separate fields? For instance,
index both firstname and lastname in a name field in each document. Then all
you have to do to find all documents that have Brown in them is search on
the single field name. There's no need to construct complex or clauses,
which is what I suspect you're doing now.....

You can also index firstname, lastname, and a third field "bagofnames" that
contains both if you want.

Vladimir is right, unless you take some actions you always get exact matches
when you search, so I'm not surprised that searching for Fred does NOT
return Frederich-containing documents.

Whether wildcarding is your best strategy is debatable, it depends upon your
intent. You could consider stemming both at index and search time.
Generically, all that means is that you index "Fred" for "Frederich". and at
search time you search for "Fred". There are stemming analyzers in Lucene,
but whether they do what you want only you can decide. Be aware that
wildcard searches frequently run afoul of a TooManyClauses exception. There
is a bunch of information from people far more knowledgeable than me in this
mail archive if you search on "wildcard". It's a more complicated matter
than you might think at first.

Note that the TooManyClauses exception very likely won't show up in your
test data sets. It won't show up until you put in a bunch of data. See
TooManyClauses in the javadoc.....

I'd argue if you can create rules for wildcarding, you can create rules for
stemming By that I mean that if you have a programmatic way to turn
"Frederich" into "Fred", you can just index and search on "Fred" equally
easily and not have to deal with wildcard complexity.

Of course, since I don't know what problem you're actually trying to solve,
this may be irrelevant....

Best
Erick

On 11/7/06, Alice <al...@gmail.com> wrote:
>
> Hello!
>
> I am totally new to Lucene and I'm trying to use it with my web
> application.
>
>
>
> What I'm doing is reading a table from my database line by line and
> indexing
> the columns.
>
> I read the users data such as First Name, Last Name, Email and so on.
>
>
>
> I hava a field by column, such as: firstName="Frederich" lastName="Brown"
>
> My intetion is to make users find other users using name, lastName or
> email.
>
> What is the best way to do it?
>
> If the user enter "Fred Brown", as I use queryParser, this string is
> broken
> to "Fred" "Brown" and I search those tokens in every field.
>
>
>
> As "Brown" was indexed, the user's registry is found.
>
> But if the user enters "Fred", no user is found.
>
> Why is that? I thought it would return the user "Frederich Brown" either.
>
> Can someone help?
>
>
>
> Thanks!
>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Help on search

Posted by Erick Erickson <er...@gmail.com>.

You really have to get away from thinking about lucene like you would a
database, or you'll have endless problems <G>.

If you want just a bag of name parts, why use separate fields? For instance,
index both firstname and lastname in a name field in each document. Then all
you have to do to find all documents that have Brown in them is search on
the single field name. There's no need to construct complex or clauses,
which is what I suspect you're doing now.....

You can also index firstname, lastname, and a third field "bagofnames" that
contains both if you want.

Vladimir is right, unless you take some actions you always get exact matches
when you search, so I'm not surprised that searching for Fred does NOT
return Frederich-containing documents.

Whether wildcarding is your best strategy is debatable, it depends upon your
intent. You could consider stemming both at index and search time.
Generically, all that means is that you index "Fred" for "Frederich". and at
search time you search for "Fred". There are stemming analyzers in Lucene,
but whether they do what you want only you can decide. Be aware that
wildcard searches frequently run afoul of a TooManyClauses exception. There
is a bunch of information from people far more knowledgeable than me in this
mail archive if you search on "wildcard". It's a more complicated matter
than you might think at first.

Note that the TooManyClauses exception very likely won't show up in your
test data sets. It won't show up until you put in a bunch of data. See
TooManyClauses in the javadoc.....

I'd argue if you can create rules for wildcarding, you can create rules for
stemming By that I mean that if you have a programmatic way to turn
"Frederich" into "Fred", you can just index and search on "Fred" equally
easily and not have to deal with wildcard complexity.

Of course, since I don't know what problem you're actually trying to solve,
this may be irrelevant....

Best
Erick

On 11/7/06, Alice <al...@gmail.com> wrote:
>
> Hello!
>
> I am totally new to Lucene and I'm trying to use it with my web
> application.
>
>
>
> What I'm doing is reading a table from my database line by line and
> indexing
> the columns.
>
> I read the users data such as First Name, Last Name, Email and so on.
>
>
>
> I hava a field by column, such as: firstName="Frederich" lastName="Brown"
>
> My intetion is to make users find other users using name, lastName or
> email.
>
> What is the best way to do it?
>
> If the user enter "Fred Brown", as I use queryParser, this string is
> broken
> to "Fred" "Brown" and I search those tokens in every field.
>
>
>
> As "Brown" was indexed, the user's registry is found.
>
> But if the user enters "Fred", no user is found.
>
> Why is that? I thought it would return the user "Frederich Brown" either.
>
> Can someone help?
>
>
>
> Thanks!
>
>
>
>
>

RE: Help on search

Posted by Vladimir Olenin <VO...@cihi.ca>.

Search for 'Fred*' if I'm not mistaken...

-----Original Message-----
From: Alice [mailto:alicelista@gmail.com] 
Sent: Tuesday, November 07, 2006 11:51 AM
To: java-user@lucene.apache.org
Subject: Help on search

Hello!

I am totally new to Lucene and I'm trying to use it with my web
application.

What I'm doing is reading a table from my database line by line and
indexing the columns.

I read the users data such as First Name, Last Name, Email and so on.

I hava a field by column, such as: firstName="Frederich"
lastName="Brown"

My intetion is to make users find other users using name, lastName or
email.

What is the best way to do it?

If the user enter "Fred Brown", as I use queryParser, this string is
broken to "Fred" "Brown" and I search those tokens in every field.

As "Brown" was indexed, the user's registry is found.

But if the user enters "Fred", no user is found.

Why is that? I thought it would return the user "Frederich Brown"
either.

Can someone help?

Thanks!

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org