Title: | Clean, Parse, Harmonize, Match, and Geocode Messy Real-World Addresses |
---|---|
Description: | Addresses that were not validated at the time of collection are often heterogenously formatted, making them difficult to compare or link to other sets of addresses. The addr package is designed to clean character strings of addresses, use the `usaddress` library to tag address components, and paste together select components to create a normalized address. Normalized addresses can be hashed to create hashdresses that can be used to merge with other sets of addresses. |
Authors: | Cole Brokamp [aut, cre], Erika Manning [aut] |
Maintainer: | Cole Brokamp <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.6.0 |
Built: | 2025-01-10 15:32:09 UTC |
Source: | https://github.com/geomarker-io/addr |
An addr vector is created by converting messy, real-world mailing addresses in a
character vector into a list of standardized address tags that behaves like a vector.
addr()
(and as_addr()
) vectors are a list of address tags under the hood, constructed
by tagging address components using addr_tag()
and combining them into specific fields:
street_number
: AddressNumber
street_name
: StreetNamePreType
, StreetNamePreDirectional
, StreetName
street_type
: StreetNamePostType
, StreetNamePostDirectional
city
: PlaceName
state
: StateName
zip_code
: ZipCode
addr( x = character(), clean_address_text = TRUE, expand_street_type = TRUE, abbrev_cardinal_dir = TRUE, clean_zip_code = TRUE ) as_addr(x, ...)
addr( x = character(), clean_address_text = TRUE, expand_street_type = TRUE, abbrev_cardinal_dir = TRUE, clean_zip_code = TRUE ) as_addr(x, ...)
x |
a character vector of address strings |
clean_address_text |
logical; use |
expand_street_type |
logical; use |
abbrev_cardinal_dir |
logical; abbreviate cardinal directions? (e.g., "west" -> "w") |
clean_zip_code |
logical; remove any non-digit (or hyphen) characters and truncate tagged ZIP Code to 5 characters? |
... |
used to pass arguments in |
In addition to the cleaning steps described in the arguments, the street number is coerced
to a numeric after removing non-numeric characters.
See addr_tag()
for details on address component tagging.
In the case of an address having more than one word for a tag (e.g., "Riva Ridge" for StreetName
),
then these are concatenated together, separated by a space in the order they appeared in the address.
Compared to using addr()
, as_addr()
processes input character strings such that
parsing is done once per unique input, usually speeding up address parsing in real-world
datasets where address strings are often duplicated across observations.
as_addr(c("3333 Burnet Ave Cincinnati OH 45229", "1324 Burnet Ave Cincinnati OH 45229"))
as_addr(c("3333 Burnet Ave Cincinnati OH 45229", "1324 Burnet Ave Cincinnati OH 45229"))
For an addr vector, the string distances are calculated between a reference addr vector (ref_addr
).
A list of matching reference addr vectors less than or equal to the specified
optimal string alignment
distances are returned.
See stringdist::stringdist-metrics
for more details on string metrics and the optimal string alignment (osa
) method.
addr_match( x, ref_addr, stringdist_match = c("osa_lt_1", "exact"), match_street_type = TRUE, simplify = TRUE ) addr_match_street_name_and_number( x, ref_addr, stringdist_match = c("osa_lt_1", "exact"), match_street_type = TRUE, simplify = TRUE ) addr_match_street( x, ref_addr, stringdist_match = c("osa_lt_1", "exact"), match_street_type = TRUE )
addr_match( x, ref_addr, stringdist_match = c("osa_lt_1", "exact"), match_street_type = TRUE, simplify = TRUE ) addr_match_street_name_and_number( x, ref_addr, stringdist_match = c("osa_lt_1", "exact"), match_street_type = TRUE, simplify = TRUE ) addr_match_street( x, ref_addr, stringdist_match = c("osa_lt_1", "exact"), match_street_type = TRUE )
x |
an addr vector to match |
ref_addr |
an addr vector to search for matches in |
stringdist_match |
method for determining string match of street name: "osa_lt_1" requires an optimized string distance less than 1; "exact" requires an exact match |
match_street_type |
logical; require street type to be identical to match? |
simplify |
logical; randomly select one addr from multi-matches and return an addr() vector instead of a list? (empty addr vectors and NULL values are converted to NA) |
for addr_match()
and addr_match_street_name_number()
,
a named list of possible addr matches for each addr in x
for addr_match_street, a list of possible addr matches for each addr in x
(as ref_addr
indices)
addr(c("3333 Burnet Ave Cincinnati OH 45229", "5130 RAPID RUN RD CINCINNATI OHIO 45238")) |> addr_match(cagis_addr()$cagis_addr) addr(c("3333 Burnet Ave Cincinnati OH 45229", "5130 RAPID RUN RD CINCINNATI OHIO 45238")) |> addr_match(cagis_addr()$cagis_addr, simplify = FALSE) |> tibble::enframe(name = "input_addr", value = "ca") |> dplyr::mutate(ca = purrr::list_c(ca)) |> dplyr::left_join(cagis_addr(), by = c("ca" = "cagis_addr")) |> tidyr::unnest(cols = c(cagis_addr_data)) |> dplyr::select(-ca, -cagis_address)
addr(c("3333 Burnet Ave Cincinnati OH 45229", "5130 RAPID RUN RD CINCINNATI OHIO 45238")) |> addr_match(cagis_addr()$cagis_addr) addr(c("3333 Burnet Ave Cincinnati OH 45229", "5130 RAPID RUN RD CINCINNATI OHIO 45238")) |> addr_match(cagis_addr()$cagis_addr, simplify = FALSE) |> tibble::enframe(name = "input_addr", value = "ca") |> dplyr::mutate(ca = purrr::list_c(ca)) |> dplyr::left_join(cagis_addr(), by = c("ca" = "cagis_addr")) |> tidyr::unnest(cols = c(cagis_addr_data)) |> dplyr::select(-ca, -cagis_address)
Addresses are attempted to be matched to reference geographies using different methods
associated with decreasing levels of precision in the order listed below.
Each method generates matched s2 cell identifiers differently
and is recorded in the match_method
column of the returned tibble:
ref_addr
: reference s2 cell from direct match to reference address
tiger_range
: centroid of street-matched TIGER address ranges containing street number
tiger_street
: centroid of street-matched TIGER address ranges closest to the street number
none
: unmatched using all previous approaches; return missing s2 cell identifier
addr_match_geocode( x, ref_addr = cagis_addr()$cagis_addr, ref_s2, county = "39061", year = "2022" )
addr_match_geocode( x, ref_addr = cagis_addr()$cagis_addr, ref_s2, county = "39061", year = "2022" )
x |
an addr vector (or character vector of address strings) to geocode |
ref_addr |
an addr vector to search for matches in |
ref_s2 |
a s2_cell vector of locations for each ref_addr |
county |
character county identifer for TIGER street range files to search for matches in |
year |
character year for TIGER street range files to search for matches in |
Performance was compared to the degauss geocoder (see /inst/compare_geocoding_to_degauss.R
) using
real-world addresses in voter_addresses()
.
Match success rates were similar, but DeGAUSS matched about 5% more of the addresses. These differences are
sensitive to the match criteria considered for DeGAUSS (here precision of 'range' & score > 0.7 or
precision of 'street' & score > 0.55):
addr_matched | degauss_matched | n | perc |
TRUE | TRUE | 224714 | 92.8% |
FALSE | TRUE | 13407 | 5.5% |
FALSE | FALSE | 2993 | 1.2% |
TRUE | FALSE | 1019 | 0.4% |
Among those that were geocoded by both, 97.7% were geocoded to the same census tract, and 96.6% to the same block group:
ct_agree | bg_agree | n | s2_dist_ptiles (5th, 25th, 50th, 75th, 95th) | perc |
TRUE | TRUE | 217179 | 14.7, 24.3, 39, 68.9, 153.6 | 96.6% |
FALSE | FALSE | 4805 | 21.6, 39.2, 158.9, 5577.9, 16998.8 | 2.1% |
TRUE | FALSE | 2730 | 19.6, 28.6, 41.2, 94.8, 571.8 | 1.2% |
a tibble with columns: addr
contains x
converted to an addr
vector,
s2
contains the resulting geocoded s2 cells as an s2cell
vector,
match_method
is a factor with levels described above
set.seed(1) cagis_s2 <- cagis_addr()$cagis_addr_data |> purrr::modify_if(\(.) length(.) > 0 && nrow(.) > 1, dplyr::slice_sample, n = 1) |> purrr::map_vec(purrr::pluck, "cagis_s2", .default = NA, .ptype = s2::s2_cell()) addr_match_geocode(x = sample(voter_addresses(), 100), ref_s2 = cagis_s2) |> print(n = 100)
set.seed(1) cagis_s2 <- cagis_addr()$cagis_addr_data |> purrr::modify_if(\(.) length(.) > 0 && nrow(.) > 1, dplyr::slice_sample, n = 1) |> purrr::map_vec(purrr::pluck, "cagis_s2", .default = NA, .ptype = s2::s2_cell()) addr_match_geocode(x = sample(voter_addresses(), 100), ref_s2 = cagis_s2) |> print(n = 100)
Match an addr vector to TIGER street ranges
addr_match_tiger_street_ranges( x, county = "39061", year = "2022", street_only_match = c("none", "all", "closest"), summarize = c("none", "union", "centroid") )
addr_match_tiger_street_ranges( x, county = "39061", year = "2022", street_only_match = c("none", "all", "closest"), summarize = c("none", "union", "centroid") )
x |
an addr vector to match |
county |
character string of county identifier |
year |
year of tigris product |
street_only_match |
for addresses that match a TIGER street name, but have street numbers that don't
intersect with ranges of potential street numbers, return |
summarize |
optionally summarize matched street ranges as their union or centroid |
To best parse street names and types, this function appends dummy address components just
for the purposes of matching tiger street range names (e.g., 1234 {tiger_street_name} Anytown AB 00000
)
a list of matched tigris street range tibbles;
a NULL value indicates that no street name was matched; if street_only_match
is FALSE,
a street range tibble with zero rows indicates that although a street was matched,
there was no range containing the street number
my_addr <- as_addr(c("224 Woolper Ave", "3333 Burnet Ave", "33333 Burnet Ave", "609 Walnut St")) addr_match_tiger_street_ranges(my_addr, county = "39061", street_only_match = "all") addr_match_tiger_street_ranges(my_addr, county = "39061", summarize = "centroid") addr_match_tiger_street_ranges(my_addr, county = "39061", street_only_match = "closest", summarize = "centroid") |> dplyr::bind_rows() |> dplyr::mutate(census_bg_id = s2_join_tiger_bg(s2::as_s2_cell(s2_geography)))
my_addr <- as_addr(c("224 Woolper Ave", "3333 Burnet Ave", "33333 Burnet Ave", "609 Walnut St")) addr_match_tiger_street_ranges(my_addr, county = "39061", street_only_match = "all") addr_match_tiger_street_ranges(my_addr, county = "39061", summarize = "centroid") addr_match_tiger_street_ranges(my_addr, county = "39061", street_only_match = "closest", summarize = "centroid") |> dplyr::bind_rows() |> dplyr::mutate(census_bg_id = s2_join_tiger_bg(s2::as_s2_cell(s2_geography)))
The address components are tagged using a rust port of usaddress. Component names are based upon the United States Thoroughfare, Landmark, and Postal Address Data Standard.
addr_tag(x, clean_address_text = TRUE)
addr_tag(x, clean_address_text = TRUE)
x |
a character vector of addresses |
clean_address_text |
logical; use clean_address_text() to clean addresses prior to tagging? |
Possible address labels include:
AddressNumberPrefix
AddressNumberSuffix
AddressNumber
BuildingName
CornerOf
IntersectionSeparator
LandmarkName
NotAddress
OccupancyIdentifier
OccupancyType
PlaceName
Recipient
StateName
StreetNamePostDirectional
StreetNamePostType
StreetNamePreDirectional
StreetNamePreModifier
StreetNamePreType
StreetName
SubaddressIdentifier
SubaddressType
USPSBoxGroupID
USPSBoxGroupType
USPSBoxID
USPSBoxType
ZipCode
Find more information about the definitions here
a list, the same length as x, of named character vectors of address component tags; each vector contains all space-separated elements of the cleaned address and are each named based on inferred address labels (see Details)
addr_tag(c("290 Ludlow Avenue Apt #2 Cincinnati OH 45220", "3333 Burnet Ave Cincinnati OH 45219"))
addr_tag(c("290 Ludlow Avenue Apt #2 Cincinnati OH 45220", "3333 Burnet Ave Cincinnati OH 45219"))
CAGIS Addresses
cagis_addr()
cagis_addr()
An example tibble created from the CAGIS addresses with a pre-calculated, unique cagis_addr
vector column.
The cagis_addr_data
column is a list of tibbles because one CAGIS address can correspond to multiple
parcel identifiers and address-level data (place, type, s2, etc.).
See inst/make_cagis_addr.R
for source code to create data, including filtering criteria:
use only addresses that have STATUS
of ASSIGNED
or USING
and are not orphaned (ORPHANFLG == "N"
)
omit addresses with ADDRTYPE
s that are milemarkers (MM
), parks (PAR
), infrastructure projects (PRJ
),
cell towers (CTW
), vacant or commercial lots (LOT
), and other miscellaneous non-residential addresses (MIS
, RR
, TBA
)
s2 cell is derived from LONGITUDE and LATITUDE fields in CAGIS address database
cagis_addr()
cagis_addr()
remove excess whitespace; keep only letters, numbers, and -
clean_address_text(.x)
clean_address_text(.x)
.x |
a vector of address character strings |
a vector of cleaned addresses
clean_address_text(c( "3333 Burnet Ave Cincinnati OH 45219", "33_33 Burnet Ave. Cincinnati OH 45219", "33\\33 B\"urnet Ave; Ci!ncinn&*ati OH 45219", "3333 Burnet Ave Cincinnati OH 45219", "33_33 Burnet Ave. Cincinnati OH 45219" ))
clean_address_text(c( "3333 Burnet Ave Cincinnati OH 45219", "33_33 Burnet Ave. Cincinnati OH 45219", "33\\33 B\"urnet Ave; Ci!ncinn&*ati OH 45219", "3333 Burnet Ave Cincinnati OH 45219", "33_33 Burnet Ave. Cincinnati OH 45219" ))
The Cincinnati Evicition Hotspots data was downloaded from Eviction Labs and contains characteristics of the top 100 buildings that are responsible for about 25% of all eviction filings in Cincinnati (from their "current through 8-31-2024" release).
elh_data()
elh_data()
https://evictionlab.org/eviction-tracking/cincinnati-oh/
a tibble with 100 rows and 9 columns
elh_data()
elh_data()
Abbreviations of street type (e.g., "Ave", "St") are converted to expanded versions (e.g., "Avenue", "Street").
expand_post_type(x)
expand_post_type(x)
x |
character vector of |
a character vector of the same length containing the expanded street name post type
expand_post_type(c("ave", "av", "Avenue", "tl"))
expand_post_type(c("ave", "av", "Avenue", "tl"))
get s2_geography for census block groups
get_tiger_block_groups(state, year)
get_tiger_block_groups(state, year)
state |
census FIPS state identifier |
year |
vintage of TIGER/Line block group geography files |
a tibble with GEOID
and s2_geography
columns
get_tiger_block_groups(state = "39", year = "2022")
get_tiger_block_groups(state = "39", year = "2022")
Downloaded files are cached in tools::R_user_dir("addr", "cache")
.
Street ranges with missing minimum or maximum address numbers are excluded.
get_tiger_street_ranges(county, year = "2022")
get_tiger_street_ranges(county, year = "2022")
county |
character string of county identifier |
year |
year of tigris product |
a list of tibbles, one for each street name, with TLID
, s2_geography
, from
, and to
columns
Sys.setenv("R_USER_CACHE_DIR" = tempfile()) get_tiger_street_ranges("39061")[1001:1004]
Sys.setenv("R_USER_CACHE_DIR" = tempfile()) get_tiger_street_ranges("39061")[1001:1004]
Get the identifier of the closest census block group based on the intersection of the s2 cell locations with the the US Census TIGER/Line shapefiles
s2_join_tiger_bg(x, year = as.character(2013:2023))
s2_join_tiger_bg(x, year = as.character(2013:2023))
x |
s2_cell vector |
year |
vintage of TIGER/Line block group geography files |
character vector of matched census block group identifiers
s2_join_tiger_bg(x = s2::as_s2_cell(c("8841b39a7c46e25f", "8841a45555555555")), year = "2023")
s2_join_tiger_bg(x = s2::as_s2_cell(c("8841b39a7c46e25f", "8841a45555555555")), year = "2023")
get s2_geography for census states
tiger_states(year)
tiger_states(year)
year |
vintage of TIGER/Line block group geography files |
a tibble with GEOID
and s2_geography
columns
tiger_states(year = "2022")
tiger_states(year = "2022")
Return list of lists of address tags to R.
usaddress_tag(input)
usaddress_tag(input)
input |
character string of addresses |
The voter_addresses data was generated as an example character vector of real-world addresses.
These addresses were downloaded from the Hamilton County, Ohio voter registration database on 2024-09-12.
See inst/make_example_addresses.R
for more details.
AddressPreDirectional
, AddressNumber
, AddressStreet
, AddressSuffix
, CityName
, "OH", and AddressZip
are pasted together to create 242,133 unique addresses of registered voters in Hamilton County, OH.
voter_addresses()
voter_addresses()
a character vector
voter_addresses() |> head()
voter_addresses() |> head()