You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by janis <ja...@yahoo.com> on 2009/01/02 06:56:41 UTC
Need help on a Lucene problem
Hi there,
Am working on web based Job search application using Lucene.User on my site
can search for jobs which are within a radius of 100 miles from say
"Boston,MA" or any other location.
Also, I need to show the search results sorted by "relevance"(ie. Score
returned by lucene) in descending order.
I'm using a 3rd party API to fetch all the cities within given radius of a
city.This API returns me around 864 cities within 100 miles radius of
"Boston,MA".
I'm building the city/state Lucene query using the following logic which is
part of my "BuildNearestCitiesQuery" method.
Here nearestCities is a hashtable returned by the above API.It contains 864
cities with CityName ass key and StateCode as value.
And finalQuery is a Lucene BooleanQuery object which contains other search
criteria entered by the user like:skills,keywords,etc.
/*code*/
foreach (string city in nearestCities.Keys)
{
BooleanQuery tempFinalQuery = finalQuery;
cityStateQuery = new BooleanQuery();
queryCity = queryParserCity.Parse(city);
queryState = queryParserState.Parse(((string[])nearestCities[city])[1]);
cityStateQuery.Add(queryCity, BooleanClause.Occur.MUST); //must is like
an AND
cityStateQuery.Add(queryState, BooleanClause.Occur.MUST);
}
nearestCityQuery.Add(cityStateQuery, BooleanClause.Occur.SHOULD); //should
is like an OR
finalQuery.Add(nearestCityQuery, BooleanClause.Occur.MUST);
/*code*/
I then input finalQuery object to Lucene's Search method to get all the jobs
within 100 miles radius.:
searcher.Search(finalQuery, collector);
I found out this BuildNearestCitiesQuery method takes a whopping 29 seconds
on an average to execute which obviously is unacceptable by any standards of
a website.I also found out that the statements involving "Parse" take a
considerable amount of time to execute as compared to other statements.
A job for a given location is a dynamic attribute in the sense that a city
could have 2 jobs(meeting a particular search criteria) today,but zero job
for the same search criteria after 3 days.So,I cannot use any "Caching" over
here.
Is there any way I can optimize this logic?or for that matter my whole
approach/algorithm towards finding all jobs within 100 miles using Lucene?
FYI,here is how my indexing in Lucene looks like:
doc.Add(new Field("jobId", job.JobID.ToString().Trim(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.Add(new Field("title", job.JobTitle.Trim(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.Add(new Field("description", job.JobDescription.Trim(), Field.Store.NO,
Field.Index.TOKENIZED));
doc.Add(new Field("city", job.City.Trim(), Field.Store.YES,
Field.Index.TOKENIZED , Field.TermVector.YES));
doc.Add(new Field("state", job.StateCode.Trim(), Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.YES));
doc.Add(new Field("citystate", job.City.Trim() + ", " +
job.StateCode.Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED ,
Field.TermVector.YES));
doc.Add(new Field("datePosted", jobPostedDateTime, Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.Add(new Field("company", job.HiringCoName.Trim(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.Add(new Field("jobType", job.JobTypeID.ToString(), Field.Store.NO,
Field.Index.UN_TOKENIZED,Field.TermVector.YES));
doc.Add(new Field("sector", job.SectorID.ToString(), Field.Store.NO,
Field.Index.UN_TOKENIZED, Field.TermVector.YES));
doc.Add(new Field("showAllJobs", "yy", Field.Store.NO,
Field.Index.UN_TOKENIZED));
Thanks a ton for reading!I would really appreciate your help on this.
Janis
--
View this message in context: http://www.nabble.com/Need-help-on-a-Lucene-problem-tp21248342p21248342.html
Sent from the Lucene - General mailing list archive at Nabble.com.
Re: Need help on a Lucene problem
Posted by André Warnier <aw...@ice-sa.com>.
janis wrote:
>
> Is there any way I can optimize this logic?or for that matter my whole
> approach/algorithm towards finding all jobs within 100 miles using Lucene?
>
Hi.
I don't know how Lucene works per se.
But think about what you are really doing with your logic :
You are telling the search engine to
- look (in the whole database) for all items which have city = city-1,
and keep a list of these item numbers
- look for all items which have city = city-2, and keep a list of these
item numbers
...
- look for all items which have city = city-864, and keep a list of
these item numbers
- now combine all the item numbers above, and return a list of the
unique item numbers among them
- look for all the items that have state = state-1, and keep a list..
- look for all ... state-2, and keep a list...
...
- now combine all these items and return a list of the unique item
numbers among them
- now combine the list from the cities, with the list from the states,
and return a list of all unique item numbers among them
- look for all items which have skill = skill-1, and keep a list
...
... etc..
If your database contains 1,000,000 job items, no wonder it is taking 29
seconds.
You would be much better off doing a first query, using first the
criteria that are the most restrictive (aka will probably give the
fewest hits), then applying another query to that result set and get
another smaller set, then apply another query to that set to restrict it
even further, etc..
Another aspect is that search engines like Lucene are the right tool to
use when you are searching words which occur in a text, in relative
position to eachother, and/or after stemming etc..
But they are not necessarily the best tool to use when you are looking
for a strict (aka "stupid") string comparison, such as ' city == "New
York" ', where the city name is in a field of its own and is in a fixed
(predictable) form. (I mean that to search "New York" you can just
compare the string "New York" and you do not have to do a query like
"the word New next to the word York").
For example, since you already have your 864 city names in a table, in a
known form, and since your items all have a field "city" in a known
form, you could use Lucene to do the query excluding the city, get the
list of results in an array, and then do a simple scan of your array in
Java, keeping only the items that match one of your cities of choice
(string comparison). The same for the State.
With 10,000 results and 864 cities, using perl this would probably take
less than a second. Your mileage with Java may vary.