Re: Parsing postal addresses

cfc@world.std.com (Chris F Clark)
16 Oct 1997 00:25:02 -0400

          From comp.compilers

Related articles
Parsing postal addresses brians0@aol.com (1997-10-14)
Re: Parsing postal addresses lindsay_j@rmc.ca (John Lindsay) (1997-10-16)
Re: Parsing postal addresses cfc@world.std.com (1997-10-16)
Re: Parsing postal addresses dweller@news.imagin.net (1997-10-17)
Re: Parsing postal addresses henry@zoo.toronto.edu (Henry Spencer) (1997-10-19)
Re: Parsing postal addresses mac@coos.dartmouth.edu (1997-10-21)
Re: Parsing postal addresses ct7@mitre.org (W. Craig Trader) (1997-10-26)
| List of all articles for this month |

From: cfc@world.std.com (Chris F Clark)
Newsgroups: comp.compilers
Date: 16 Oct 1997 00:25:02 -0400
Organization: The World Public Access UNIX, Brookline, MA
References: 97-10-067
Keywords: parse

> I am looking for information on parsing postal addresses (especially
> US addresses). Specifically, I am looking for code and/or libraries to
> standardize addresses (in accordance with USPS rules).


I don't know of any public available sources which do that. Moreover,
I know that the problem is harder than one might think. We did some
work for a major shipping company to help them implement an address
parser in Yacc++, with a goal of only 90% accuracy (i.e. one address
in ten was not recognized). A basic skeleton for such a parser is
included as one of the examples, but it only handles a couple of
address forms and none of the complicated cases and is only minimally
better than no skeleton at all.


Anyway, there are two things which combine to make the problem hard--
errors in the input and the imprecise nature of names and addresses.
The source of input for the project we worked on was scanned in
address labels which introduced some errors that may not be present in
your source. However, the other source of errors, people using
non-standard address forms is likely to be there in any case, unless
you have some way of accepting only restricted forms of addresses such
as a windows dialog box.


The non-standard address forms are what combine with the imprecise
nature of names and addresses to increase the complexity. Take your
hypothetical difficult case, such as
            Della Street (name)
            California Avenue, Inc. (company name)
            10 New York NE Ave 5 (street address with apartment number)
            California, New Mexico 84562 (city, state, and zip)
and introduce an error in it, and see if the parser doesn't
misidentify one of the fields (or simply reject the address as
unparseable).


My memory says that simply taking the USPS book and coding it in a
grammar takes a very short time (under a day) and gives you about 80%
accuracy over real-life addresses. At that point, you have to decide
what your priorities are for false-negatives (addresses which it says
are illegal and aren't) and false-positives (addresses which it says
are legal, but which some fields contain incorrect information) and
how much time you are willing to invest in tuning your grammar.


The real weakness in such a system lies in the fact that most people
only deal with a limited number of addresses in their life. Thus,
they can apply the rules sloppily with no ill effect, since they are
only going to encounter a few exceptions to the rules, and those they
can memorize as special cases, often not even noticing that they are
violating the rules. I don't know of anyway of capturing those
exceptions in advance.


Hope this helps,
-Chris


*****************************************************************************
Chris Clark Internet : compres@world.std.com
Compiler Resources, Inc. CompuServe : 74252,1375
3 Proctor Street voice : (508) 435-5016
Hopkinton, MA 01748 USA fax : (508) 435-4847 (24 hours)
[You can try out the USPS' own address standardizer at
http://www.usps.gov/ncsc/lookups/lookup_zip+4.html, where there's also
a link to vendors of address handling software and service. It is my
impression that the parser isn't all that fancy but they have a rather
large set of patterns and synonyms. They also have a database of
every single valid postal address in the country and I expect they do
some fuzzy matching to get the best match for each address. -John]


--


Post a followup to this message

Return to the comp.compilers page.
Search the comp.compilers archives again.