You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Pavel Minchenkov <ch...@gmail.com> on 2010/12/15 16:28:59 UTC

Custom scoring for searhing geographic objects

Hi,
Please give me advise how to create custom scoring. I need to result that
documents were in order, depending on how popular each term in the document
(popular = how many times it appears in the index) and length of the
document (less terms - higher in search results).

For example, index contains following data:

ID    | SEARCH_FIELD
------------------------------
1     | Russia
2     | Russia, Moscow
3     | Russia, Volgograd
4     | Russia, Ivanovo
5     | Russia, Ivanovo, Altayskaya street 45
6     | Russia, Moscow, Kremlin
7     | Russia, Moscow, Altayskaya street
8     | Russia, Moscow, Altayskaya street 15
9     | Russia, Moscow, Altayskaya street 15/26


And I should get next results:


Query                     | Document result set
----------------------------------------------
Russia                    | 1,2,4,3,6,7,8,9,5
Moscow                  | 2,6,7,8,9
Ivanovo                    | 4,5
Altayskaya              | 7,8,9,5

In fact --- it is a search for geographic objects (cities, streets, houses).
At the same time can be given only part of the address, and the results
should appear the most relevant results.

Thanks.
-- 
Pavel Minchenkov

Re: Custom scoring for searhing geographic objects

Posted by Doron Cohen <cd...@gmail.com>.
Also, when taking the Similarity suggestion below note two things in
Lucene's default behavior that you seem to wish to avoid:

The first is IDF - but only for multi-term queries - otherwise ignore this
comment.
For multi term queries to only consider term frequency and doc length, you
may want to always return 1 for idf() in your Similarity impl (otherwise
terms appearing in more documents will contribute less to the score, which
you seem to wish to avoid).

The second is doc length normalization inaccuracy - as doc lengths are
encoded lossly at search time Lucene might not distinguish the difference
between two documents whose lengths are almost the same. For this, at
indexing time, your Similarity impl for lengthNorm() could be e.g. 1/(10 *
numTokens) - this way reducing the chances that two docs of different length
have the same search time norm.

Doron

On Wed, Dec 15, 2010 at 5:43 PM, Ian Lea <ia...@gmail.com> wrote:

> Sounds to me that lucene should do a pretty good job without any extra
> work on your part.  See javadocs for
> org.apache.lucene.search.Similarity
> for details on how it works.  You can change things by providing your
> own implementation.
>
> There is also the org.apache.lucene.search.function package but that
> is much more complex.
>
>
> A web search for "lucene scoring" should find you lots of info.
>
>
> --
> Ian.
>
>
> On Wed, Dec 15, 2010 at 3:28 PM, Pavel Minchenkov <ch...@gmail.com>
> wrote:
> > Hi,
> > Please give me advise how to create custom scoring. I need to result that
> > documents were in order, depending on how popular each term in the
> document
> > (popular = how many times it appears in the index) and length of the
> > document (less terms - higher in search results).
> >
> > For example, index contains following data:
> >
> > ID    | SEARCH_FIELD
> > ------------------------------
> > 1     | Russia
> > 2     | Russia, Moscow
> > 3     | Russia, Volgograd
> > 4     | Russia, Ivanovo
> > 5     | Russia, Ivanovo, Altayskaya street 45
> > 6     | Russia, Moscow, Kremlin
> > 7     | Russia, Moscow, Altayskaya street
> > 8     | Russia, Moscow, Altayskaya street 15
> > 9     | Russia, Moscow, Altayskaya street 15/26
> >
> >
> > And I should get next results:
> >
> >
> > Query                     | Document result set
> > ----------------------------------------------
> > Russia                    | 1,2,4,3,6,7,8,9,5
> > Moscow                  | 2,6,7,8,9
> > Ivanovo                    | 4,5
> > Altayskaya              | 7,8,9,5
> >
> > In fact --- it is a search for geographic objects (cities, streets,
> houses).
> > At the same time can be given only part of the address, and the results
> > should appear the most relevant results.
> >
> > Thanks.
> > --
> > Pavel Minchenkov
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Custom scoring for searhing geographic objects

Posted by Ian Lea <ia...@gmail.com>.
Sounds to me that lucene should do a pretty good job without any extra
work on your part.  See javadocs for
org.apache.lucene.search.Similarity
for details on how it works.  You can change things by providing your
own implementation.

There is also the org.apache.lucene.search.function package but that
is much more complex.


A web search for "lucene scoring" should find you lots of info.


--
Ian.


On Wed, Dec 15, 2010 at 3:28 PM, Pavel Minchenkov <ch...@gmail.com> wrote:
> Hi,
> Please give me advise how to create custom scoring. I need to result that
> documents were in order, depending on how popular each term in the document
> (popular = how many times it appears in the index) and length of the
> document (less terms - higher in search results).
>
> For example, index contains following data:
>
> ID    | SEARCH_FIELD
> ------------------------------
> 1     | Russia
> 2     | Russia, Moscow
> 3     | Russia, Volgograd
> 4     | Russia, Ivanovo
> 5     | Russia, Ivanovo, Altayskaya street 45
> 6     | Russia, Moscow, Kremlin
> 7     | Russia, Moscow, Altayskaya street
> 8     | Russia, Moscow, Altayskaya street 15
> 9     | Russia, Moscow, Altayskaya street 15/26
>
>
> And I should get next results:
>
>
> Query                     | Document result set
> ----------------------------------------------
> Russia                    | 1,2,4,3,6,7,8,9,5
> Moscow                  | 2,6,7,8,9
> Ivanovo                    | 4,5
> Altayskaya              | 7,8,9,5
>
> In fact --- it is a search for geographic objects (cities, streets, houses).
> At the same time can be given only part of the address, and the results
> should appear the most relevant results.
>
> Thanks.
> --
> Pavel Minchenkov
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom scoring for searhing geographic objects

Posted by Alexey Serba <as...@gmail.com>.
Hi Pavel,

I had the similar problem several years ago - I had to find
geographical locations in textual descriptions, geocode these objects
to lat/long during indexing process and allow users to filter/sort
search results to specific geographical areas. The important issue was
that there were several types of geographical objects - street < town
< region < country. The idea was to geocode to most narrow
geographical area as possible. Relevance logic in this case could be
specified as "find the most narrow result that is unique identified by
your text or search query".  So I came up with custom algorithm that
was quite good in terms of performance and precision/recall. Here's
the simple description:
* You can intersect all text/searchquery terms with locations
dictionary to find only geo terms
* Search in your locations Lucene index and filter only street objects
(the most narrow areas). Due to tf*idf formula you'll get the most
relevant results. Then you need to post process N (3/5/10) results and
verify that they are matches indeed. I did intersect search terms with
result's terms and make another lucene search to verify if these terms
are unique identifying the match. If it's then return matching street.
If there's no any match proceed using the same algorithm with towns,
regions, countries.

HTH,
Alexey

On Wed, Dec 15, 2010 at 6:28 PM, Pavel Minchenkov <ch...@gmail.com> wrote:
> Hi,
> Please give me advise how to create custom scoring. I need to result that
> documents were in order, depending on how popular each term in the document
> (popular = how many times it appears in the index) and length of the
> document (less terms - higher in search results).
>
> For example, index contains following data:
>
> ID    | SEARCH_FIELD
> ------------------------------
> 1     | Russia
> 2     | Russia, Moscow
> 3     | Russia, Volgograd
> 4     | Russia, Ivanovo
> 5     | Russia, Ivanovo, Altayskaya street 45
> 6     | Russia, Moscow, Kremlin
> 7     | Russia, Moscow, Altayskaya street
> 8     | Russia, Moscow, Altayskaya street 15
> 9     | Russia, Moscow, Altayskaya street 15/26
>
>
> And I should get next results:
>
>
> Query                     | Document result set
> ----------------------------------------------
> Russia                    | 1,2,4,3,6,7,8,9,5
> Moscow                  | 2,6,7,8,9
> Ivanovo                    | 4,5
> Altayskaya              | 7,8,9,5
>
> In fact --- it is a search for geographic objects (cities, streets, houses).
> At the same time can be given only part of the address, and the results
> should appear the most relevant results.
>
> Thanks.
> --
> Pavel Minchenkov
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom scoring for searhing geographic objects

Posted by Alexey Serba <as...@gmail.com>.
Hi Pavel,

I had the similar problem several years ago - I had to find
geographical locations in textual descriptions, geocode these objects
to lat/long during indexing process and allow users to filter/sort
search results to specific geographical areas. The important issue was
that there were several types of geographical objects - street < town
< region < country. The idea was to geocode to most narrow
geographical area as possible. Relevance logic in this case could be
specified as "find the most narrow result that is unique identified by
your text or search query".  So I came up with custom algorithm that
was quite good in terms of performance and precision/recall. Here's
the simple description:
* You can intersect all text/searchquery terms with locations
dictionary to find only geo terms
* Search in your locations Lucene index and filter only street objects
(the most narrow areas). Due to tf*idf formula you'll get the most
relevant results. Then you need to post process N (3/5/10) results and
verify that they are matches indeed. I did intersect search terms with
result's terms and make another lucene search to verify if these terms
are unique identifying the match. If it's then return matching street.
If there's no any match proceed using the same algorithm with towns,
regions, countries.

HTH,
Alexey

On Wed, Dec 15, 2010 at 6:28 PM, Pavel Minchenkov <ch...@gmail.com> wrote:
> Hi,
> Please give me advise how to create custom scoring. I need to result that
> documents were in order, depending on how popular each term in the document
> (popular = how many times it appears in the index) and length of the
> document (less terms - higher in search results).
>
> For example, index contains following data:
>
> ID    | SEARCH_FIELD
> ------------------------------
> 1     | Russia
> 2     | Russia, Moscow
> 3     | Russia, Volgograd
> 4     | Russia, Ivanovo
> 5     | Russia, Ivanovo, Altayskaya street 45
> 6     | Russia, Moscow, Kremlin
> 7     | Russia, Moscow, Altayskaya street
> 8     | Russia, Moscow, Altayskaya street 15
> 9     | Russia, Moscow, Altayskaya street 15/26
>
>
> And I should get next results:
>
>
> Query                     | Document result set
> ----------------------------------------------
> Russia                    | 1,2,4,3,6,7,8,9,5
> Moscow                  | 2,6,7,8,9
> Ivanovo                    | 4,5
> Altayskaya              | 7,8,9,5
>
> In fact --- it is a search for geographic objects (cities, streets, houses).
> At the same time can be given only part of the address, and the results
> should appear the most relevant results.
>
> Thanks.
> --
> Pavel Minchenkov
>

Re: Custom scoring for searhing geographic objects

Posted by Grant Ingersoll <gs...@apache.org>.
Have a look at http://lucene.apache.org/java/3_0_2/scoring.html on how Lucene's scoring works.  You can override the Similarity class in Solr as well via the schema.xml file.  

On Dec 15, 2010, at 10:28 AM, Pavel Minchenkov wrote:

> Hi,
> Please give me advise how to create custom scoring. I need to result that
> documents were in order, depending on how popular each term in the document
> (popular = how many times it appears in the index) and length of the
> document (less terms - higher in search results).
> 
> For example, index contains following data:
> 
> ID    | SEARCH_FIELD
> ------------------------------
> 1     | Russia
> 2     | Russia, Moscow
> 3     | Russia, Volgograd
> 4     | Russia, Ivanovo
> 5     | Russia, Ivanovo, Altayskaya street 45
> 6     | Russia, Moscow, Kremlin
> 7     | Russia, Moscow, Altayskaya street
> 8     | Russia, Moscow, Altayskaya street 15
> 9     | Russia, Moscow, Altayskaya street 15/26
> 
> 
> And I should get next results:
> 
> 
> Query                     | Document result set
> ----------------------------------------------
> Russia                    | 1,2,4,3,6,7,8,9,5
> Moscow                  | 2,6,7,8,9
> Ivanovo                    | 4,5
> Altayskaya              | 7,8,9,5
> 
> In fact --- it is a search for geographic objects (cities, streets, houses).
> At the same time can be given only part of the address, and the results
> should appear the most relevant results.
> 
> Thanks.
> -- 
> Pavel Minchenkov

--------------------------
Grant Ingersoll
http://www.lucidimagination.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Custom scoring for searhing geographic objects

Posted by Grant Ingersoll <gs...@apache.org>.
Have a look at http://lucene.apache.org/java/3_0_2/scoring.html on how Lucene's scoring works.  You can override the Similarity class in Solr as well via the schema.xml file.  

On Dec 15, 2010, at 10:28 AM, Pavel Minchenkov wrote:

> Hi,
> Please give me advise how to create custom scoring. I need to result that
> documents were in order, depending on how popular each term in the document
> (popular = how many times it appears in the index) and length of the
> document (less terms - higher in search results).
> 
> For example, index contains following data:
> 
> ID    | SEARCH_FIELD
> ------------------------------
> 1     | Russia
> 2     | Russia, Moscow
> 3     | Russia, Volgograd
> 4     | Russia, Ivanovo
> 5     | Russia, Ivanovo, Altayskaya street 45
> 6     | Russia, Moscow, Kremlin
> 7     | Russia, Moscow, Altayskaya street
> 8     | Russia, Moscow, Altayskaya street 15
> 9     | Russia, Moscow, Altayskaya street 15/26
> 
> 
> And I should get next results:
> 
> 
> Query                     | Document result set
> ----------------------------------------------
> Russia                    | 1,2,4,3,6,7,8,9,5
> Moscow                  | 2,6,7,8,9
> Ivanovo                    | 4,5
> Altayskaya              | 7,8,9,5
> 
> In fact --- it is a search for geographic objects (cities, streets, houses).
> At the same time can be given only part of the address, and the results
> should appear the most relevant results.
> 
> Thanks.
> -- 
> Pavel Minchenkov

--------------------------
Grant Ingersoll
http://www.lucidimagination.com