You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@lucene.apache.org by janis <ja...@yahoo.com> on 2009/01/02 06:56:41 UTC

Need help on a Lucene problem

Hi there,
 
Am working on web based Job search application using Lucene.User on my site
can search for jobs which are  within a radius of 100 miles from say
"Boston,MA" or any other location.
Also, I need to show the search results sorted by "relevance"(ie. Score
returned by lucene) in descending order.

I'm using  a 3rd party API to fetch all the cities within given radius of a
city.This API returns me around 864 cities within 100 miles radius of
"Boston,MA".

I'm building the city/state Lucene query using the following logic which is
part of my "BuildNearestCitiesQuery" method.
Here nearestCities is a hashtable returned by the above API.It contains 864
cities with CityName ass key and StateCode as value. 
And finalQuery is a Lucene BooleanQuery object which contains other search
criteria entered by the user like:skills,keywords,etc.

/*code*/            
foreach (string city in nearestCities.Keys)

{

    BooleanQuery tempFinalQuery = finalQuery;

    cityStateQuery = new BooleanQuery();    

    queryCity = queryParserCity.Parse(city);

    queryState = queryParserState.Parse(((string[])nearestCities[city])[1]);

    cityStateQuery.Add(queryCity, BooleanClause.Occur.MUST); //must is like
an AND

    cityStateQuery.Add(queryState, BooleanClause.Occur.MUST);

} 


nearestCityQuery.Add(cityStateQuery, BooleanClause.Occur.SHOULD); //should
is like an OR

 

finalQuery.Add(nearestCityQuery, BooleanClause.Occur.MUST);

 

/*code*/

 


I then input finalQuery object to Lucene's Search method to get all the jobs
within 100 miles radius.:

searcher.Search(finalQuery, collector);

 

I found out this BuildNearestCitiesQuery method takes a whopping 29 seconds
on an average to execute which obviously is unacceptable by any standards of
a website.I also found out that the statements involving "Parse" take a
considerable amount of time to execute as compared to other statements.
 
A job for a given location is a dynamic attribute in the sense that a city
could have 2 jobs(meeting a particular search criteria) today,but zero job
for the same search criteria after 3 days.So,I cannot use any "Caching" over
here.

Is there any way I can optimize this logic?or for that matter my whole
approach/algorithm towards finding all jobs within 100 miles using Lucene?
 

FYI,here is how my indexing in Lucene looks like:

 

doc.Add(new Field("jobId", job.JobID.ToString().Trim(), Field.Store.YES,
Field.Index.UN_TOKENIZED));

doc.Add(new Field("title", job.JobTitle.Trim(), Field.Store.YES,
Field.Index.TOKENIZED));

doc.Add(new Field("description", job.JobDescription.Trim(), Field.Store.NO,
Field.Index.TOKENIZED));

doc.Add(new Field("city", job.City.Trim(), Field.Store.YES,
Field.Index.TOKENIZED , Field.TermVector.YES));

doc.Add(new Field("state", job.StateCode.Trim(), Field.Store.YES,
Field.Index.TOKENIZED, Field.TermVector.YES));

doc.Add(new Field("citystate", job.City.Trim() + ", " +
job.StateCode.Trim(), Field.Store.YES, Field.Index.UN_TOKENIZED ,
Field.TermVector.YES));

doc.Add(new Field("datePosted", jobPostedDateTime, Field.Store.YES,
Field.Index.UN_TOKENIZED));

doc.Add(new Field("company", job.HiringCoName.Trim(), Field.Store.YES,
Field.Index.TOKENIZED));

doc.Add(new Field("jobType", job.JobTypeID.ToString(), Field.Store.NO,
Field.Index.UN_TOKENIZED,Field.TermVector.YES));

doc.Add(new Field("sector", job.SectorID.ToString(), Field.Store.NO,
Field.Index.UN_TOKENIZED, Field.TermVector.YES));

doc.Add(new Field("showAllJobs", "yy", Field.Store.NO,
Field.Index.UN_TOKENIZED));


Thanks a ton for reading!I would really appreciate your help on this.
 
 
Janis
-- 
View this message in context: http://www.nabble.com/Need-help-on-a-Lucene-problem-tp21248342p21248342.html
Sent from the Lucene - General mailing list archive at Nabble.com.


Re: Need help on a Lucene problem

Posted by André Warnier <aw...@ice-sa.com>.
janis wrote:
> 
> Is there any way I can optimize this logic?or for that matter my whole
> approach/algorithm towards finding all jobs within 100 miles using Lucene?
>  
Hi.
I don't know how Lucene works per se.
But think about what you are really doing with your logic :
You are telling the search engine to

- look (in the whole database) for all items which have city = city-1, 
and keep a list of these item numbers
- look for all items which have city = city-2, and keep a list of these 
item numbers
...
- look for all items which have city = city-864, and keep a list of 
these item numbers

- now combine all the item numbers above, and return a list of the 
unique item numbers among them

- look for all the items that have state = state-1, and keep a list..
- look for all ... state-2, and keep a list...
...
- now combine all these items and return a list of the unique item 
numbers among them

- now combine the list from the cities, with the list from the states, 
and return a list of all unique item numbers among them

- look for all items which have skill = skill-1, and keep a list
...
... etc..

If your database contains 1,000,000 job items, no wonder it is taking 29 
seconds.

You would be much better off doing a first query, using first the 
criteria that are the most restrictive (aka will probably give the 
fewest hits), then applying another query to that result set and get 
another smaller set, then apply another query to that set to restrict it 
even further, etc..

Another aspect is that search engines like Lucene are the right tool to 
use when you are searching words which occur in a text, in relative 
position to eachother, and/or after stemming etc..
But they are not necessarily the best tool to use when you are looking 
for a strict (aka "stupid") string comparison, such as ' city == "New 
York" ', where the city name is in a field of its own and is in a fixed 
(predictable) form. (I mean that to search "New York" you can just 
compare the string "New York" and you do not have to do a query like 
"the word New next to the word York").
For example, since you already have your 864 city names in a table, in a 
known form, and since your items all have a field "city" in a known 
form, you could use Lucene to do the query excluding the city, get the 
list of results in an array, and then do a simple scan of your array in 
Java, keeping only the items that match one of your cities of choice 
(string comparison).  The same for the State.
With 10,000 results and 864 cities, using perl this would probably take 
less than a second. Your mileage with Java may vary.