You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by Ron Grabowski <ro...@yahoo.com> on 2009/11/03 02:13:52 UTC

I want to index mailing addresses...how can I map Ave to Avenue, St to Street, Ct to Court, etc.?

I'm looking to index mailing addresses. I'd like to take into account these common abbreviations:

 http://www.usps.com/ncsc/lookups/usps_abbreviations.html

Would those be considered synonyms? I'm not exactly sure if I should use the WordNet modules or extend a built in analyzer and append my own filter.

Has someone (in Java or .NET) already written a mailing address analyzer that handles normalizing things like "163 N 4th St" into "163 North Fourth Street"...if that's even a good thing to do?


Re: I want to index mailing addresses...how can I map Ave to Avenue, St to Street, Ct to Court, etc.?

Posted by Ron Grabowski <ro...@yahoo.com>.
Sounds like you wrote a synonym token filter?

 http://www.codeproject.com/KB/cs/lucene_custom_analyzer.aspx

Looks like that's the way to go:

 http://www.mail-archive.com/lucene-user@jakarta.apache.org/msg12581.html

Same technique mentioned in Lucene in Action 4.2.1:

"
A token with a zero position increment places the token in the same position as the previous token. 
Analyzers that inject synonyms can use a position increment of zero for the synonyms. The effect is that 
phrase queries work regardless of which synonym was used in the query. See our SynonymAnalyzer in 
section 4.6 for an example that uses position increments of zero.
"

Hopefully I can just slurp up the data from the usps.com page. That along with "first -> 1st" should get me far enough.



----- Original Message ----
From: John Anderson <jo...@gmail.com>
To: lucene-net-user@incubator.apache.org
Sent: Wed, November 4, 2009 8:17:28 PM
Subject: Re: I want to index mailing addresses...how can I map Ave to Avenue,  St to Street, Ct to Court, etc.?

I have done something similar for city names.  The approach I took:

1) Identify a finite set of mappings.  For example, "first"->"1st",
"street"->"st", "road"->"rd".  This is a labor-intensive process, but you
can start small add more over time.  (In my case the mappings were: "fort"->
"ft", "saint"->"st", etc.)

2) Use a custom TokenFilter that inserts extra tokens into the stream.  Use
setPositionIncrement(0) to 'stack' terms in the same position.  Put the
mapping terms from step 1 into a dictionary.  If you see "First Street" in
the data, add an extra token "1st" at the same position as "First" and an
extra token "st" at the same position as "street".  Now any of the following
phrase queries will match: "1st st", "1st street", "first st", "first
street".

Hope that helps...

On Mon, Nov 2, 2009 at 8:45 PM, Ron Grabowski <ro...@yahoo.com>wrote:

> I have a database of company names and their addresses. The user will have
> already matched at the city and state level. I want to match the more free
> form user entered company name and address to one of the known mailing
> addresses. My thought was to create something that normalizes the various
> ways of writing "1st St", "First St", "First Street" to help with matching.
> Eventually I might look into using geocoding but my initial thought was to
> get things as self-correcting as possible using Lucene without adding
> another layer of potentially slow matching (the geocoding stuff).
>
>
>
> ----- Original Message ----
> From: Robert Taintor <ro...@gmail.com>
> To: lucene-net-user@incubator.apache.org
> Sent: Mon, November 2, 2009 8:18:52 PM
> Subject: Re: I want to index mailing addresses...how can I map Ave to
> Avenue,  St to Street, Ct to Court, etc.?
>
> it might be better to use geocoding depending on your use case.  i know
> there is a lucene spatial indexer that will let you search "nearby" but you
> could also just index the geocode and use that to retrieve the record.
>
> On Mon, Nov 2, 2009 at 8:13 PM, Ron Grabowski <rongrabowski@yahoo.com
> >wrote:
>
> > I'm looking to index mailing addresses. I'd like to take into account
> these
> > common abbreviations:
> >
> >  http://www.usps.com/ncsc/lookups/usps_abbreviations.html
> >
> > Would those be considered synonyms? I'm not exactly sure if I should use
> > the WordNet modules or extend a built in analyzer and append my own
> filter.
> >
> > Has someone (in Java or .NET) already written a mailing address analyzer
> > that handles normalizing things like "163 N 4th St" into "163 North
> Fourth
> > Street"...if that's even a good thing to do?
> >
> >
>
>


Re: I want to index mailing addresses...how can I map Ave to Avenue, St to Street, Ct to Court, etc.?

Posted by John Anderson <jo...@gmail.com>.
I have done something similar for city names.  The approach I took:

1) Identify a finite set of mappings.  For example, "first"->"1st",
"street"->"st", "road"->"rd".  This is a labor-intensive process, but you
can start small add more over time.  (In my case the mappings were: "fort"->
"ft", "saint"->"st", etc.)

2) Use a custom TokenFilter that inserts extra tokens into the stream.  Use
setPositionIncrement(0) to 'stack' terms in the same position.  Put the
mapping terms from step 1 into a dictionary.  If you see "First Street" in
the data, add an extra token "1st" at the same position as "First" and an
extra token "st" at the same position as "street".  Now any of the following
phrase queries will match: "1st st", "1st street", "first st", "first
street".

Hope that helps...

On Mon, Nov 2, 2009 at 8:45 PM, Ron Grabowski <ro...@yahoo.com>wrote:

> I have a database of company names and their addresses. The user will have
> already matched at the city and state level. I want to match the more free
> form user entered company name and address to one of the known mailing
> addresses. My thought was to create something that normalizes the various
> ways of writing "1st St", "First St", "First Street" to help with matching.
> Eventually I might look into using geocoding but my initial thought was to
> get things as self-correcting as possible using Lucene without adding
> another layer of potentially slow matching (the geocoding stuff).
>
>
>
> ----- Original Message ----
> From: Robert Taintor <ro...@gmail.com>
> To: lucene-net-user@incubator.apache.org
> Sent: Mon, November 2, 2009 8:18:52 PM
> Subject: Re: I want to index mailing addresses...how can I map Ave to
> Avenue,  St to Street, Ct to Court, etc.?
>
> it might be better to use geocoding depending on your use case.  i know
> there is a lucene spatial indexer that will let you search "nearby" but you
> could also just index the geocode and use that to retrieve the record.
>
> On Mon, Nov 2, 2009 at 8:13 PM, Ron Grabowski <rongrabowski@yahoo.com
> >wrote:
>
> > I'm looking to index mailing addresses. I'd like to take into account
> these
> > common abbreviations:
> >
> >  http://www.usps.com/ncsc/lookups/usps_abbreviations.html
> >
> > Would those be considered synonyms? I'm not exactly sure if I should use
> > the WordNet modules or extend a built in analyzer and append my own
> filter.
> >
> > Has someone (in Java or .NET) already written a mailing address analyzer
> > that handles normalizing things like "163 N 4th St" into "163 North
> Fourth
> > Street"...if that's even a good thing to do?
> >
> >
>
>

Re: I want to index mailing addresses...how can I map Ave to Avenue, St to Street, Ct to Court, etc.?

Posted by Ron Grabowski <ro...@yahoo.com>.
I have a database of company names and their addresses. The user will have already matched at the city and state level. I want to match the more free form user entered company name and address to one of the known mailing addresses. My thought was to create something that normalizes the various ways of writing "1st St", "First St", "First Street" to help with matching. Eventually I might look into using geocoding but my initial thought was to get things as self-correcting as possible using Lucene without adding another layer of potentially slow matching (the geocoding stuff).



----- Original Message ----
From: Robert Taintor <ro...@gmail.com>
To: lucene-net-user@incubator.apache.org
Sent: Mon, November 2, 2009 8:18:52 PM
Subject: Re: I want to index mailing addresses...how can I map Ave to Avenue,  St to Street, Ct to Court, etc.?

it might be better to use geocoding depending on your use case.  i know
there is a lucene spatial indexer that will let you search "nearby" but you
could also just index the geocode and use that to retrieve the record.

On Mon, Nov 2, 2009 at 8:13 PM, Ron Grabowski <ro...@yahoo.com>wrote:

> I'm looking to index mailing addresses. I'd like to take into account these
> common abbreviations:
>
>  http://www.usps.com/ncsc/lookups/usps_abbreviations.html
>
> Would those be considered synonyms? I'm not exactly sure if I should use
> the WordNet modules or extend a built in analyzer and append my own filter.
>
> Has someone (in Java or .NET) already written a mailing address analyzer
> that handles normalizing things like "163 N 4th St" into "163 North Fourth
> Street"...if that's even a good thing to do?
>
>


Re: I want to index mailing addresses...how can I map Ave to Avenue, St to Street, Ct to Court, etc.?

Posted by Robert Taintor <ro...@gmail.com>.
it might be better to use geocoding depending on your use case.  i know
there is a lucene spatial indexer that will let you search "nearby" but you
could also just index the geocode and use that to retrieve the record.

On Mon, Nov 2, 2009 at 8:13 PM, Ron Grabowski <ro...@yahoo.com>wrote:

> I'm looking to index mailing addresses. I'd like to take into account these
> common abbreviations:
>
>  http://www.usps.com/ncsc/lookups/usps_abbreviations.html
>
> Would those be considered synonyms? I'm not exactly sure if I should use
> the WordNet modules or extend a built in analyzer and append my own filter.
>
> Has someone (in Java or .NET) already written a mailing address analyzer
> that handles normalizing things like "163 N 4th St" into "163 North Fourth
> Street"...if that's even a good thing to do?
>
>