You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Srinivas Bharghav <sr...@gmail.com> on 2009/03/06 07:25:15 UTC

Using Lucene for user query parsing

I am trying to evaluate as to whether Lucene is the right candidate for the
problem at hand.

Say I have 3 indexes:

Index 1 has street names.
Index 2 has business names.
Index 3 has area names.

All these names can be single words or a combination of words like woodward
street or marks and spencers street etc etc.

Now the use enters a query saying "mc donalds woodward street kingston
precinct".

I have to parse this query and come up with the best match possible. The
problem is, in the query I do not know which part is the business name or
area name or street name. Also the user may give the query in any order for
example he may give it as "kingston precinct mc donalds woodward street".
There might be spelling mistkaes in the query enterd by the user. Also he
might use road for street or lane for street and such things. I know that
Lucene is the right candidate for the synonym and spelling mistakes part but
am a bit hazy regarding the user query parsing part as to in which index to
search what. Any help is greatly appreciated.

Thanks,
Srini.

Re: Using Lucene for user query parsing

Posted by Shashi Kant <sk...@sloan.mit.edu>.

The BoW approach is simple and highly effective IMO. If you want to get a
bit fancy, you could also use a MultiField query in the combined index.

Another brute-force approach would be to hit all 3 indexes and see which
ones come back with the highest score(s).



On Mon, Mar 9, 2009 at 8:43 AM, Erick Erickson <er...@gmail.com>wrote:

> Sure, Lucene is suited. If....
>
> The central problem here isn't the search engine, IMO, it's
> figuring out what bits of the query are relevant to what
> parts of the data. That is, in some random string, what is
> the street, business name, address, etc.
>
> Lucene has nothing built in that I know of that'll help with
> that part. Once you *have* figured out what parts of the
> query relate to what fields in your index, the rest is easy.
> But you'll have to do the figuring out yourself.
>
> But you might try the bagowords I suggested before as
> a shortcut and see what kind of results you get. Sometimes
> simplistic solutions are "good enough", but that's always
> up to you to decide once you start seeing results.
>
> Best
> Erick
>
> On Mon, Mar 9, 2009 at 4:31 AM, Srinivas Bharghav
> <sr...@gmail.com>wrote:
>
> > Thanks for all the inputs guys.
> >
> > As Erick said let me elaborate the problem a bit.
> >
> > We are trying to develop a local search application. The user will be
> able
> > to locate businesses, localities and roads. We have data for all the 3
> with
> > us. We do not want to provide separate boxes for the user to enter data
> i.e
> > a common one for all entry box (a la google :)) where the user enters an
> > address (or road name or area name) or all the 3 etc etc. From the user
> > query we have to find the best possible match in our data. The data has
> > lots
> > of numbers as well as names with initials and stuff like that. The user
> may
> > enter the names with a space between the initals or they might club the
> > initials together etc etc. From the user query we do not have a way to
> > figure out what is what apart from the obvious ones as to if something
> ends
> > with a road then it is a road name or if there is a layout in the query
> > then
> > it is an area etc. Right now we have our own custom framework. I am
> trying
> > to figure out as to whether Lucene is suited for this kind of
> application.
> >
> > Once again thanks for all the inputs.
> >
> > On Fri, Mar 6, 2009 at 7:15 PM, Erick Erickson <erickerickson@gmail.com
> > >wrote:
> >
> > > Whatever you do will be wrong <G>. What you're saying is
> > > that you have structured data that the user wants to search
> > > in an unstructured way, and you want to try to create a
> > > system that intuits what the user meant. Good luck <G>.
> > >
> > > Can you back up a bit and talk about the problem you're
> > > trying to solve? If, for instance, you're trying to find the
> > > best match for a particular business, one approach would
> > > be to create one index where each business had
> > >
> > > street
> > > business
> > > area
> > > bagowords
> > >
> > > where the field bagowords contained a copy of the data
> > > from the other three fields, then search bagowords
> > > for your query terms. It sounds simplistic, but it might be
> > > surprisingly good.
> > >
> > > And if this is out in left field, a higher level statement
> > > of the problem would help get better answers.
> > >
> > > Best
> > > Erick
> > >
> > > On Fri, Mar 6, 2009 at 1:25 AM, Srinivas Bharghav
> > > <sr...@gmail.com>wrote:
> > >
> > > > I am trying to evaluate as to whether Lucene is the right candidate
> for
> > > the
> > > > problem at hand.
> > > >
> > > > Say I have 3 indexes:
> > > >
> > > > Index 1 has street names.
> > > > Index 2 has business names.
> > > > Index 3 has area names.
> > > >
> > > > All these names can be single words or a combination of words like
> > > woodward
> > > > street or marks and spencers street etc etc.
> > > >
> > > > Now the use enters a query saying "mc donalds woodward street
> kingston
> > > > precinct".
> > > >
> > > > I have to parse this query and come up with the best match possible.
> > The
> > > > problem is, in the query I do not know which part is the business
> name
> > or
> > > > area name or street name. Also the user may give the query in any
> order
> > > for
> > > > example he may give it as "kingston precinct mc donalds woodward
> > street".
> > > > There might be spelling mistkaes in the query enterd by the user.
> Also
> > he
> > > > might use road for street or lane for street and such things. I know
> > that
> > > > Lucene is the right candidate for the synonym and spelling mistakes
> > part
> > > > but
> > > > am a bit hazy regarding the user query parsing part as to in which
> > index
> > > to
> > > > search what. Any help is greatly appreciated.
> > > >
> > > > Thanks,
> > > > Srini.
> > > >
> > >
> >
>

Re: Using Lucene for user query parsing

Posted by Erick Erickson <er...@gmail.com>.

Sure, Lucene is suited. If....

The central problem here isn't the search engine, IMO, it's
figuring out what bits of the query are relevant to what
parts of the data. That is, in some random string, what is
the street, business name, address, etc.

Lucene has nothing built in that I know of that'll help with
that part. Once you *have* figured out what parts of the
query relate to what fields in your index, the rest is easy.
But you'll have to do the figuring out yourself.

But you might try the bagowords I suggested before as
a shortcut and see what kind of results you get. Sometimes
simplistic solutions are "good enough", but that's always
up to you to decide once you start seeing results.

Best
Erick

On Mon, Mar 9, 2009 at 4:31 AM, Srinivas Bharghav
<sr...@gmail.com>wrote:

> Thanks for all the inputs guys.
>
> As Erick said let me elaborate the problem a bit.
>
> We are trying to develop a local search application. The user will be able
> to locate businesses, localities and roads. We have data for all the 3 with
> us. We do not want to provide separate boxes for the user to enter data i.e
> a common one for all entry box (a la google :)) where the user enters an
> address (or road name or area name) or all the 3 etc etc. From the user
> query we have to find the best possible match in our data. The data has
> lots
> of numbers as well as names with initials and stuff like that. The user may
> enter the names with a space between the initals or they might club the
> initials together etc etc. From the user query we do not have a way to
> figure out what is what apart from the obvious ones as to if something ends
> with a road then it is a road name or if there is a layout in the query
> then
> it is an area etc. Right now we have our own custom framework. I am trying
> to figure out as to whether Lucene is suited for this kind of application.
>
> Once again thanks for all the inputs.
>
> On Fri, Mar 6, 2009 at 7:15 PM, Erick Erickson <erickerickson@gmail.com
> >wrote:
>
> > Whatever you do will be wrong <G>. What you're saying is
> > that you have structured data that the user wants to search
> > in an unstructured way, and you want to try to create a
> > system that intuits what the user meant. Good luck <G>.
> >
> > Can you back up a bit and talk about the problem you're
> > trying to solve? If, for instance, you're trying to find the
> > best match for a particular business, one approach would
> > be to create one index where each business had
> >
> > street
> > business
> > area
> > bagowords
> >
> > where the field bagowords contained a copy of the data
> > from the other three fields, then search bagowords
> > for your query terms. It sounds simplistic, but it might be
> > surprisingly good.
> >
> > And if this is out in left field, a higher level statement
> > of the problem would help get better answers.
> >
> > Best
> > Erick
> >
> > On Fri, Mar 6, 2009 at 1:25 AM, Srinivas Bharghav
> > <sr...@gmail.com>wrote:
> >
> > > I am trying to evaluate as to whether Lucene is the right candidate for
> > the
> > > problem at hand.
> > >
> > > Say I have 3 indexes:
> > >
> > > Index 1 has street names.
> > > Index 2 has business names.
> > > Index 3 has area names.
> > >
> > > All these names can be single words or a combination of words like
> > woodward
> > > street or marks and spencers street etc etc.
> > >
> > > Now the use enters a query saying "mc donalds woodward street kingston
> > > precinct".
> > >
> > > I have to parse this query and come up with the best match possible.
> The
> > > problem is, in the query I do not know which part is the business name
> or
> > > area name or street name. Also the user may give the query in any order
> > for
> > > example he may give it as "kingston precinct mc donalds woodward
> street".
> > > There might be spelling mistkaes in the query enterd by the user. Also
> he
> > > might use road for street or lane for street and such things. I know
> that
> > > Lucene is the right candidate for the synonym and spelling mistakes
> part
> > > but
> > > am a bit hazy regarding the user query parsing part as to in which
> index
> > to
> > > search what. Any help is greatly appreciated.
> > >
> > > Thanks,
> > > Srini.
> > >
> >
>

Re: Using Lucene for user query parsing

Posted by Srinivas Bharghav <sr...@gmail.com>.

Thanks for all the inputs guys.

As Erick said let me elaborate the problem a bit.

We are trying to develop a local search application. The user will be able
to locate businesses, localities and roads. We have data for all the 3 with
us. We do not want to provide separate boxes for the user to enter data i.e
a common one for all entry box (a la google :)) where the user enters an
address (or road name or area name) or all the 3 etc etc. From the user
query we have to find the best possible match in our data. The data has lots
of numbers as well as names with initials and stuff like that. The user may
enter the names with a space between the initals or they might club the
initials together etc etc. From the user query we do not have a way to
figure out what is what apart from the obvious ones as to if something ends
with a road then it is a road name or if there is a layout in the query then
it is an area etc. Right now we have our own custom framework. I am trying
to figure out as to whether Lucene is suited for this kind of application.

Once again thanks for all the inputs.

On Fri, Mar 6, 2009 at 7:15 PM, Erick Erickson <er...@gmail.com>wrote:

> Whatever you do will be wrong <G>. What you're saying is
> that you have structured data that the user wants to search
> in an unstructured way, and you want to try to create a
> system that intuits what the user meant. Good luck <G>.
>
> Can you back up a bit and talk about the problem you're
> trying to solve? If, for instance, you're trying to find the
> best match for a particular business, one approach would
> be to create one index where each business had
>
> street
> business
> area
> bagowords
>
> where the field bagowords contained a copy of the data
> from the other three fields, then search bagowords
> for your query terms. It sounds simplistic, but it might be
> surprisingly good.
>
> And if this is out in left field, a higher level statement
> of the problem would help get better answers.
>
> Best
> Erick
>
> On Fri, Mar 6, 2009 at 1:25 AM, Srinivas Bharghav
> <sr...@gmail.com>wrote:
>
> > I am trying to evaluate as to whether Lucene is the right candidate for
> the
> > problem at hand.
> >
> > Say I have 3 indexes:
> >
> > Index 1 has street names.
> > Index 2 has business names.
> > Index 3 has area names.
> >
> > All these names can be single words or a combination of words like
> woodward
> > street or marks and spencers street etc etc.
> >
> > Now the use enters a query saying "mc donalds woodward street kingston
> > precinct".
> >
> > I have to parse this query and come up with the best match possible. The
> > problem is, in the query I do not know which part is the business name or
> > area name or street name. Also the user may give the query in any order
> for
> > example he may give it as "kingston precinct mc donalds woodward street".
> > There might be spelling mistkaes in the query enterd by the user. Also he
> > might use road for street or lane for street and such things. I know that
> > Lucene is the right candidate for the synonym and spelling mistakes part
> > but
> > am a bit hazy regarding the user query parsing part as to in which index
> to
> > search what. Any help is greatly appreciated.
> >
> > Thanks,
> > Srini.
> >
>

Re: Using Lucene for user query parsing

Posted by Erick Erickson <er...@gmail.com>.

Whatever you do will be wrong <G>. What you're saying is
that you have structured data that the user wants to search
in an unstructured way, and you want to try to create a
system that intuits what the user meant. Good luck <G>.

Can you back up a bit and talk about the problem you're
trying to solve? If, for instance, you're trying to find the
best match for a particular business, one approach would
be to create one index where each business had

street
business
area
bagowords

where the field bagowords contained a copy of the data
from the other three fields, then search bagowords
for your query terms. It sounds simplistic, but it might be
surprisingly good.

And if this is out in left field, a higher level statement
of the problem would help get better answers.

Best
Erick

On Fri, Mar 6, 2009 at 1:25 AM, Srinivas Bharghav
<sr...@gmail.com>wrote:

> I am trying to evaluate as to whether Lucene is the right candidate for the
> problem at hand.
>
> Say I have 3 indexes:
>
> Index 1 has street names.
> Index 2 has business names.
> Index 3 has area names.
>
> All these names can be single words or a combination of words like woodward
> street or marks and spencers street etc etc.
>
> Now the use enters a query saying "mc donalds woodward street kingston
> precinct".
>
> I have to parse this query and come up with the best match possible. The
> problem is, in the query I do not know which part is the business name or
> area name or street name. Also the user may give the query in any order for
> example he may give it as "kingston precinct mc donalds woodward street".
> There might be spelling mistkaes in the query enterd by the user. Also he
> might use road for street or lane for street and such things. I know that
> Lucene is the right candidate for the synonym and spelling mistakes part
> but
> am a bit hazy regarding the user query parsing part as to in which index to
> search what. Any help is greatly appreciated.
>
> Thanks,
> Srini.
>

Re: Using Lucene for user query parsing

Posted by Anshum <an...@gmail.com>.

Hi Srinivas,

Perhaps what you need here is a query formation logic which assigns the
right keywords to the right fields. Let me know in case I got it wrong. One
way to do that could be by using index time boost for fields and then
running a query (so that a particular field is preferred over the other).
As per my knowledge lucene should be a better solution that anything else
that 'I know of' for such a thing, but there'd be a few things that you
would have to build yourself as well.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............

On Fri, Mar 6, 2009 at 11:55 AM, Srinivas Bharghav <srini.bharghav@gmail.com
> wrote:

> I am trying to evaluate as to whether Lucene is the right candidate for the
> problem at hand.
>
> Say I have 3 indexes:
>
> Index 1 has street names.
> Index 2 has business names.
> Index 3 has area names.
>
> All these names can be single words or a combination of words like woodward
> street or marks and spencers street etc etc.
>
> Now the use enters a query saying "mc donalds woodward street kingston
> precinct".
>
> I have to parse this query and come up with the best match possible. The
> problem is, in the query I do not know which part is the business name or
> area name or street name. Also the user may give the query in any order for
> example he may give it as "kingston precinct mc donalds woodward street".
> There might be spelling mistkaes in the query enterd by the user. Also he
> might use road for street or lane for street and such things. I know that
> Lucene is the right candidate for the synonym and spelling mistakes part
> but
> am a bit hazy regarding the user query parsing part as to in which index to
> search what. Any help is greatly appreciated.
>
> Thanks,
> Srini.
>

Re: Using Lucene for user query parsing

Posted by Vasudevan Comandur <vc...@gmail.com>.

You could have single index file with all the names tagged at the time of
indexing. For the query parsing, you could have a lookup
 for common words ending which identify the business names (like Corp, Inc,
LLC, Ltd, etc.) and common words like (road, avenue,
street, lane etc) for address and separate the query terms at appropriate
places.

Another suggestion is to go with OpenNLP components and use POS tagger, NP
Chunker etc. which will better results during query parsing.

Regards
 Vasu

On Fri, Mar 6, 2009 at 11:55 AM, Srinivas Bharghav <srini.bharghav@gmail.com
> wrote:

> I am trying to evaluate as to whether Lucene is the right candidate for the
> problem at hand.
>
> Say I have 3 indexes:
>
> Index 1 has street names.
> Index 2 has business names.
> Index 3 has area names.
>
> All these names can be single words or a combination of words like woodward
> street or marks and spencers street etc etc.
>
> Now the use enters a query saying "mc donalds woodward street kingston
> precinct".
>
> I have to parse this query and come up with the best match possible. The
> problem is, in the query I do not know which part is the business name or
> area name or street name. Also the user may give the query in any order for
> example he may give it as "kingston precinct mc donalds woodward street".
> There might be spelling mistkaes in the query enterd by the user. Also he
> might use road for street or lane for street and such things. I know that
> Lucene is the right candidate for the synonym and spelling mistakes part
> but
> am a bit hazy regarding the user query parsing part as to in which index to
> search what. Any help is greatly appreciated.
>
> Thanks,
> Srini.
>

Re: Using Lucene for user query parsing

Posted by Ian Lea <ia...@gmail.com>.

Can you not make one index with all three types of name and just
search that?  Sounds much easier.  You might get a few funnies like
business Kingston on McDonald's street, but they'd be the exception.

--
Ian.

On Fri, Mar 6, 2009 at 6:25 AM, Srinivas Bharghav
<sr...@gmail.com> wrote:
> I am trying to evaluate as to whether Lucene is the right candidate for the
> problem at hand.
>
> Say I have 3 indexes:
>
> Index 1 has street names.
> Index 2 has business names.
> Index 3 has area names.
>
> All these names can be single words or a combination of words like woodward
> street or marks and spencers street etc etc.
>
> Now the use enters a query saying "mc donalds woodward street kingston
> precinct".
>
> I have to parse this query and come up with the best match possible. The
> problem is, in the query I do not know which part is the business name or
> area name or street name. Also the user may give the query in any order for
> example he may give it as "kingston precinct mc donalds woodward street".
> There might be spelling mistkaes in the query enterd by the user. Also he
> might use road for street or lane for street and such things. I know that
> Lucene is the right candidate for the synonym and spelling mistakes part but
> am a bit hazy regarding the user query parsing part as to in which index to
> search what. Any help is greatly appreciated.
>
> Thanks,
> Srini.
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org