You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Colin Young <Co...@tembizi.com> on 2006/01/27 17:09:17 UTC

Help with indexing and query strategy

I'm having some trouble coming up with a good search strategy for geographical data. e.g., given:
 
[1] city: London, United Kingdom
[2] city: London, Ontario, Canada
[3] city: Ontario, California, United States
[4] state: Ontario, Canada
[5] city: Vancouver, Washington, United States
[6] city: Vancouver, British Columbia, Canada
[7] city: Washington, DC, United States
[8] state: Washington, United States
 
and also given the following synonyms:
 
Ontario = ON
California = CA
Washington = WA
Canada = CA
United States = US = America = United States of America
United Kingdom = UK = Great Britain = England
 
for the following queries, I want the listed number of hits '()' from matching '[]':
 
i. Ontario (2) [3, 4]
ii. London (2) [1, 2]
iii. Ontario, Canada (1) [4]
iv. Ontario, California (1) [3]
v. Ontario, CA (2) [3, 4]
vi. Ontario, US (1) [3]
vii. Vancouver (2) [5, 6]
viii. Washington (2) [7, 8]
ix. Washington, DC (1) [7]
x. Vancouver, CA (1) [6]
xi. Vancouver, WA (1) [5]
 
How do I index and store the input (assume that I know the mechanics so I'm not looking for specific java syntax or how to generate synonyms during analysis) so that I get the desired results. My current attempt indexes strings like "London Ontario Canada", "London ON Canada", "London Ontario CA", "London ON CA" -- i.e. every combination of entity name and corresponding code -- in a content field and creates a type field containing "city" (or "state" or "country" as appropriate to identify the type of entity being indexed) and uses a phrase query with a slop of 1 which works really well except e.g. "Ontario CA" for which I'd like 2 hits, but given the above data gives 3 hits (from 2, 3 and 4, and the problem will only get worse as I add more cities in Ontario since each results in a hit). The slop of 1 is required since not all countries customarily use states, and I need to support the user optionally dropping the state as in the above example of "Ontario, CA" where we don't know if the user intended the "CA" to represent the state of California or the country of Canada, while "London, UK" would be unambiguous.
 
The major problem as I see it is that at parse time I don't know if the user is searching for a city, state or country, and I don't want to force them to specify that.
 
Does anyone have any good ideas to help me solve this problem?
 
Thanks.
 
Colin Young
 

Notice: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Help with indexing and query strategy

Posted by Rajesh Munavalli <ra...@gmail.com>.

Hi Colin,
         Even assuming you came up with a good way of indexing, the 
example query "Ontario, CA" should yield 3 hits. All 2, 3 and 4 are 
valid retrievals. Could you please justify which 2 hits you want and why?

Thanks,

Rajesh Munavalli

Colin Young wrote:
> I'm having some trouble coming up with a good search strategy for geographical data. e.g., given:
>  
> [1] city: London, United Kingdom
> [2] city: London, Ontario, Canada
> [3] city: Ontario, California, United States
> [4] state: Ontario, Canada
> [5] city: Vancouver, Washington, United States
> [6] city: Vancouver, British Columbia, Canada
> [7] city: Washington, DC, United States
> [8] state: Washington, United States
>  
> and also given the following synonyms:
>  
> Ontario = ON
> California = CA
> Washington = WA
> Canada = CA
> United States = US = America = United States of America
> United Kingdom = UK = Great Britain = England
>  
> for the following queries, I want the listed number of hits '()' from matching '[]':
>  
> i. Ontario (2) [3, 4]
> ii. London (2) [1, 2]
> iii. Ontario, Canada (1) [4]
> iv. Ontario, California (1) [3]
> v. Ontario, CA (2) [3, 4]
> vi. Ontario, US (1) [3]
> vii. Vancouver (2) [5, 6]
> viii. Washington (2) [7, 8]
> ix. Washington, DC (1) [7]
> x. Vancouver, CA (1) [6]
> xi. Vancouver, WA (1) [5]
>  
> How do I index and store the input (assume that I know the mechanics so I'm not looking for specific java syntax or how to generate synonyms during analysis) so that I get the desired results. My current attempt indexes strings like "London Ontario Canada", "London ON Canada", "London Ontario CA", "London ON CA" -- i.e. every combination of entity name and corresponding code -- in a content field and creates a type field containing "city" (or "state" or "country" as appropriate to identify the type of entity being indexed) and uses a phrase query with a slop of 1 which works really well except e.g. "Ontario CA" for which I'd like 2 hits, but given the above data gives 3 hits (from 2, 3 and 4, and the problem will only get worse as I add more cities in Ontario since each results in a hit). The slop of 1 is required since not all countries customarily use states, and I need to support the user optionally dropping the state as in the above example of "Ontario, CA" where we don't know if the user intended the "CA" to represent the state of California or the country of Canada, while "London, UK" would be unambiguous.
>  
> The major problem as I see it is that at parse time I don't know if the user is searching for a city, state or country, and I don't want to force them to specify that.
>  
> Does anyone have any good ideas to help me solve this problem?
>  
> Thanks.
>  
> Colin Young
>  
>
> Notice: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Help with indexing and query strategy

Posted by Rajesh Munavalli <fi...@gmail.com>.

Hi Colin,
        Even assuming you came up with a good way of indexing, the example
query "Ontario, CA" should yield 3 hits. All 2, 3 and 4 are valid
retrievals. Could you please justify which 2 hits you want and why?

Thanks,

Rajesh Munavalli

On 1/27/06, Colin Young <Co...@tembizi.com> wrote:
>
> I'm having some trouble coming up with a good search strategy for
> geographical data. e.g., given:
>
> [1] city: London, United Kingdom
> [2] city: London, Ontario, Canada
> [3] city: Ontario, California, United States
> [4] state: Ontario, Canada
> [5] city: Vancouver, Washington, United States
> [6] city: Vancouver, British Columbia, Canada
> [7] city: Washington, DC, United States
> [8] state: Washington, United States
>
> and also given the following synonyms:
>
> Ontario = ON
> California = CA
> Washington = WA
> Canada = CA
> United States = US = America = United States of America
> United Kingdom = UK = Great Britain = England
>
> for the following queries, I want the listed number of hits '()' from
> matching '[]':
>
> i. Ontario (2) [3, 4]
> ii. London (2) [1, 2]
> iii. Ontario, Canada (1) [4]
> iv. Ontario, California (1) [3]
> v. Ontario, CA (2) [3, 4]
> vi. Ontario, US (1) [3]
> vii. Vancouver (2) [5, 6]
> viii. Washington (2) [7, 8]
> ix. Washington, DC (1) [7]
> x. Vancouver, CA (1) [6]
> xi. Vancouver, WA (1) [5]
>
> How do I index and store the input (assume that I know the mechanics so
> I'm not looking for specific java syntax or how to generate synonyms during
> analysis) so that I get the desired results. My current attempt indexes
> strings like "London Ontario Canada", "London ON Canada", "London Ontario
> CA", "London ON CA" -- i.e. every combination of entity name and
> corresponding code -- in a content field and creates a type field containing
> "city" (or "state" or "country" as appropriate to identify the type of
> entity being indexed) and uses a phrase query with a slop of 1 which works
> really well except e.g. "Ontario CA" for which I'd like 2 hits, but given
> the above data gives 3 hits (from 2, 3 and 4, and the problem will only get
> worse as I add more cities in Ontario since each results in a hit). The slop
> of 1 is required since not all countries customarily use states, and I need
> to support the user optionally dropping the state as in the above example of
> "Ontario, CA" where we don't know if the user intended the "CA" to represent
> the state of California or the country of Canada, while "London, UK" would
> be unambiguous.
>
> The major problem as I see it is that at parse time I don't know if the
> user is searching for a city, state or country, and I don't want to force
> them to specify that.
>
> Does anyone have any good ideas to help me solve this problem?
>
> Thanks.
>
> Colin Young
>
>
> Notice: This email message is for the sole use of the intended
> recipient(s) and may contain confidential and privileged information. Any
> unauthorized review, use, disclosure or distribution is prohibited. If you
> are not the intended recipient, please contact the sender by reply email and
> destroy all copies of the original message.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>