| Title: | Clean, Parse, Harmonize, Match, and Geocode Messy Real-World US Addresses |
|---|---|
| Description: | Clean, parses, standardize, match, and geocodes messy, real-world US addresses. Use the included `usaddress` library to tag address components and build addr vector objects composed of addr_part vectors for number, street, and place. These vectors can be standardized, matched, joined, and used as data-frame columns, allowing standard R tools to work with nested address structures. |
| Authors: | Cole Brokamp [aut, cre], Erika Manning [aut] |
| Maintainer: | Cole Brokamp <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.1.0 |
| Built: | 2026-05-19 17:13:48 UTC |
| Source: | https://github.com/geomarker-io/addr |
This wraps the addr fuzzy matching helpers and returns a left-join style result. The addr columns are matched by index and rows are expanded for one-to-many or many-to-many matches.
See addr_match() and addr_left_join() for a faster alternative that
returns one selected match instead of all fuzzy matches.
addr_fuzzy_left_join( x, y, by = "addr", addr_fields = NULL, suffix = c(".x", ".y"), progress = interactive() )addr_fuzzy_left_join( x, y, by = "addr", addr_fields = NULL, suffix = c(".x", ".y"), progress = interactive() )
x, y
|
data frames or tibbles with an addr column |
by |
addr column name in |
addr_fields |
a named vector of OSA maximum distances. Defaults are used for fields that are not supplied; see Details. |
suffix |
character vector of length 2 used to suffix duplicate columns |
progress |
logical; show progress bar while processing matched ZIP groups? |
addr_fuzzy_left_join() matches addresses within ZIP code groups, so
maximum distances for place fields are ignored.
Defaults for addr_fields:
number_prefix: 0
number_digits: 0
number_suffix: 0
street_predirectional: 0
street_premodifier: 0
street_pretype: 0
street_name: 1
street_posttype: 0
street_postdirectional: 0
a data frame with left-join semantics; note that row order will be changed compared to x
my_addr <- tibble::tibble(address = voter_addresses()[1:10], addr = as_addr(address), id = sprintf("id_%04d", seq_len(10))) the_addr <- nad_example_data() addr_fuzzy_left_join(my_addr, the_addr, c("addr", "nad_addr"))my_addr <- tibble::tibble(address = voter_addresses()[1:10], addr = as_addr(address), id = sprintf("id_%04d", seq_len(10))) the_addr <- nad_example_data() addr_fuzzy_left_join(my_addr, the_addr, c("addr", "nad_addr"))
addr_fuzzy_match() matches two addr vectors using more than one address
field.
fuzzy_match_addr_field() matches two addr vectors using a single address
field.
Distances between address tags are defined using optimized string alignment;
see fuzzy_match() and stringdist::stringdist() for more details.
addr_fuzzy_match(x, y, addr_fields = NULL) fuzzy_match_addr_field( x, y, addr_field = c("number_prefix", "number_digits", "number_suffix", "street_predirectional", "street_premodifier", "street_pretype", "street_name", "street_posttype", "street_postdirectional", "place_name", "place_state", "place_zipcode"), osa_max_dist = 0 )addr_fuzzy_match(x, y, addr_fields = NULL) fuzzy_match_addr_field( x, y, addr_field = c("number_prefix", "number_digits", "number_suffix", "street_predirectional", "street_premodifier", "street_pretype", "street_name", "street_posttype", "street_postdirectional", "place_name", "place_state", "place_zipcode"), osa_max_dist = 0 )
x |
addr vector to match |
y |
addr vector to match to |
addr_fields |
a named vector of OSA maximum distances. Defaults are used for fields that are not supplied; see Details. |
addr_field |
character name of single addr field to match on |
osa_max_dist |
maximum optimized string alignment distance used as threshold for matching on single addr field |
Defaults for addr_fields:
number_prefix: 0
number_digits: 0
number_suffix: 0
street_predirectional: 0
street_premodifier: 0
street_pretype: 0
street_name: 1
street_posttype: 0
street_postdirectional: 0
place_name: 0
place_state: 0
place_zipcode: 0
When fuzzy matching street_name, the "phonetic_street_key"
prefilter is automatically used (see ?fuzzy_match).
a list of integer vectors representing the position of the best
matching address(es) in y for each address in x
x_addr <- as_addr(c("123 Main St.", "333 Burnet Ave", "3333 Foofy Ave")) y_addr <- as_addr(c("0000 Main Street", "3333 Burnet Avenue", "222 Burnet Ave")) # no matches with defaults addr_fuzzy_match(x_addr, y_addr) # match on osa_max_dist of 2 for the address number addr_fuzzy_match(x_addr, y_addr, addr_fields = c("number_digits" = 2)) # ignore address number when matching addr_fuzzy_match(x_addr, y_addr, addr_fields = c("number_digits" = Inf)) fuzzy_match_addr_field( as_addr(c("123 Main St.", "3333 Burnet Ave", "3333 Foofy Ave")), as_addr(c("0000 Main Street", "0000 Burnet Avenue", "222 Burnet Ave")), addr_field = "street_name", osa_max_dist = 1 ) # empty address fields have an OSA distance of zero and always match fuzzy_match_addr_field( as_addr(c("123 Main St.", "3333 Burnet Ave", "3333 Foofy Ave")), as_addr(c("0000 Main Street", "0000 Burnet Avenue", "222 Burnet Ave")), addr_field = "number_prefix" )x_addr <- as_addr(c("123 Main St.", "333 Burnet Ave", "3333 Foofy Ave")) y_addr <- as_addr(c("0000 Main Street", "3333 Burnet Avenue", "222 Burnet Ave")) # no matches with defaults addr_fuzzy_match(x_addr, y_addr) # match on osa_max_dist of 2 for the address number addr_fuzzy_match(x_addr, y_addr, addr_fields = c("number_digits" = 2)) # ignore address number when matching addr_fuzzy_match(x_addr, y_addr, addr_fields = c("number_digits" = Inf)) fuzzy_match_addr_field( as_addr(c("123 Main St.", "3333 Burnet Ave", "3333 Foofy Ave")), as_addr(c("0000 Main Street", "0000 Burnet Avenue", "222 Burnet Ave")), addr_field = "street_name", osa_max_dist = 1 ) # empty address fields have an OSA distance of zero and always match fuzzy_match_addr_field( as_addr(c("123 Main St.", "3333 Burnet Ave", "3333 Foofy Ave")), as_addr(c("0000 Main Street", "0000 Burnet Avenue", "222 Burnet Ave")), addr_field = "number_prefix" )
addr_left_join() is a convenience wrapper around addr_match() that
returns a left-join style result. It expands rows of x for duplicate rows
in the original y that share the exact matched addr, but it does not
return multiple distinct candidate addresses from y. addr_match() still
selects a single best address before this wrapper expands exact duplicates.
addr_left_join( x, y, by = "addr", suffix = c(".x", ".y"), zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap"), name_phonetic_dist = 2L, name_fuzzy_dist = 1L, number_fuzzy_dist = 1L, match_street_type = c("exact", "compatible", "ignore"), match_street_directional = c("exact", "swap", "ignore"), progress = interactive(), match_prepared = NULL )addr_left_join( x, y, by = "addr", suffix = c(".x", ".y"), zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap"), name_phonetic_dist = 2L, name_fuzzy_dist = 1L, number_fuzzy_dist = 1L, match_street_type = c("exact", "compatible", "ignore"), match_street_directional = c("exact", "swap", "ignore"), progress = interactive(), match_prepared = NULL )
x, y
|
data frames or tibbles with an addr column |
by |
addr column name in |
suffix |
character vector of length 2 used to suffix duplicate columns |
zip_variants |
logical; fuzzy match to common variants of
|
zip_variant |
character vector; zipcode variant types to use when
|
name_phonetic_dist |
integer; maximum optimized string alignment
distance between |
name_fuzzy_dist |
integer; maximum optimized string alignment distance
between |
number_fuzzy_dist |
integer; maximum optimized string alignment
distance between |
match_street_type |
character; how to compare street pretype and
posttype when selecting street candidates. |
match_street_directional |
character; how to compare street
predirectional and postdirectional when selecting street candidates.
|
progress |
logical; show |
match_prepared |
optional prepared |
A data frame with left-join semantics. Duplicate rows in y with
the exact same matched addr are all returned. Partial ZIP-only or
street-only matches do not expand to multiple candidate rows in y.
the_addr <- nad("Hamilton", "OH", refresh_binary = "no", refresh_source = "no") my_addr <- tibble::tibble( addr = as_addr(voter_addresses()[1:100]), id = 1:100 ) d <- addr_left_join( my_addr, the_addr, by = c("addr", "nad_addr"), match_prepared = nad_example_data(match_prepared = TRUE) ) d # some addresses may match with more than one address in NAD # since matching does not consider subaddress (e.g. "line two") # take the first row in these cases table(addr_match_stage(d$nad_addr.y[!duplicated(d$id)]))the_addr <- nad("Hamilton", "OH", refresh_binary = "no", refresh_source = "no") my_addr <- tibble::tibble( addr = as_addr(voter_addresses()[1:100]), id = 1:100 ) d <- addr_left_join( my_addr, the_addr, by = c("addr", "nad_addr"), match_prepared = nad_example_data(match_prepared = TRUE) ) d # some addresses may match with more than one address in NAD # since matching does not consider subaddress (e.g. "line two") # take the first row in these cases table(addr_match_stage(d$nad_addr.y[!duplicated(d$id)]))
A single addr in y is chosen for each addr in x. Matching is staged to
reduce the search space: ZIP codes are matched first, street names are then
matched within each matched ZIP code, and street numbers are finally matched
within each matched street and ZIP code combination. If more than one
candidate addr remains in y after these stages, the first candidate in y
is returned.
Missing or empty address components that cannot be matched at any stage are
left missing in the returned addr() values. Rows with a matched ZIP code
but no street match return an addr with only @place@zipcode filled; rows
with matched ZIP code and street but no number match also return the matched
@street.
addr_match() accepts raw reference data and prepares it internally, which
is the right default for one-off matching jobs. addr_match_prepare()
becomes useful when the same reference y will be reused across multiple
calls to addr_match(), because it caches the deduplicated reference
addresses and ZIP/street/number candidate lookups once instead of rebuilding
them on every call.
Preparing y once avoids recomputing unique(y), ZIP-code groups, and
exact street/number candidate lookups each time you call addr_match()
with the same reference addresses. For a single end-to-end match, preparing
y explicitly does not remove that work; it only moves it outside
addr_match().
addr_match( x, y, zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap"), name_phonetic_dist = 2L, name_fuzzy_dist = 1L, number_fuzzy_dist = 1L, match_street_type = c("exact", "compatible", "ignore"), match_street_directional = c("exact", "swap", "ignore"), progress = interactive() ) addr_match_prepare(y, progress = interactive())addr_match( x, y, zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap"), name_phonetic_dist = 2L, name_fuzzy_dist = 1L, number_fuzzy_dist = 1L, match_street_type = c("exact", "compatible", "ignore"), match_street_directional = c("exact", "swap", "ignore"), progress = interactive() ) addr_match_prepare(y, progress = interactive())
x |
addr vector to match |
y |
addr vector to match against, or a prepared |
zip_variants |
logical; fuzzy match to common variants of
|
zip_variant |
character vector; zipcode variant types to use when
|
name_phonetic_dist |
integer; maximum optimized string alignment
distance between |
name_fuzzy_dist |
integer; maximum optimized string alignment distance
between |
number_fuzzy_dist |
integer; maximum optimized string alignment
distance between |
match_street_type |
character; how to compare street pretype and
posttype when selecting street candidates. |
match_street_directional |
character; how to compare street
predirectional and postdirectional when selecting street candidates.
|
progress |
logical; show reference-preparation timing and a progress
bar while preparing raw |
An addr vector, the same length as x, containing the selected
match in y for each addr in x. Partial matches are returned with
matched ZIP code
and/or street fields filled when later stages do not match.
the_addr <- nad_example_data(match_prepared = TRUE) my_addr <- as_addr( c( "2700 Alice St 45222", "10623 Srpingfield Pike 45215", "173 Wuhlper Ave 45220", "12176 8th Ave 45249", "12176 7ht Ave 45249", "10 W 14th St 45202", "10 Oak Rd 45241" ) ) addr_match(my_addr, the_addr) addr_match( my_addr, the_addr, zip_variants = FALSE, name_phonetic_dist = 0L, name_fuzzy_dist = 0L, number_fuzzy_dist = 0L, match_street_type = "ignore", match_street_directional = "ignore" ) my_addr <- as_addr(voter_addresses()[1:100]) d <- addr_match(my_addr, the_addr) d addr_match_stage(d)the_addr <- nad_example_data(match_prepared = TRUE) my_addr <- as_addr( c( "2700 Alice St 45222", "10623 Srpingfield Pike 45215", "173 Wuhlper Ave 45220", "12176 8th Ave 45249", "12176 7ht Ave 45249", "10 W 14th St 45202", "10 Oak Rd 45241" ) ) addr_match(my_addr, the_addr) addr_match( my_addr, the_addr, zip_variants = FALSE, name_phonetic_dist = 0L, name_fuzzy_dist = 0L, number_fuzzy_dist = 0L, match_street_type = "ignore", match_street_directional = "ignore" ) my_addr <- as_addr(voter_addresses()[1:100]) d <- addr_match(my_addr, the_addr) d addr_match_stage(d)
Classify an addr vector into the staged outcomes returned by
addr_match(): no match, ZIP-only match, ZIP-plus-street match, or
ZIP-plus-street-plus-number match.
addr_match_stage(x, strict = TRUE)addr_match_stage(x, strict = TRUE)
x |
addr vector to classify |
strict |
logical; require |
an ordered factor with levels none, zip, street, number
y <- as_addr(c( "10 MAIN ST CINCINNATI OH 45220", "11 MAIN ST CINCINNATI OH 45220", "10 MAIN ST CINCINNATI OH 45229" )) x <- as_addr(c( "99 MAIN ST CINCINNATI OH 45220", "10 OAK ST CINCINNATI OH 45220", "10 MAIN ST CINCINNATI OH 45103" )) out <- addr_match(x, y) addr_match_stage(out)y <- as_addr(c( "10 MAIN ST CINCINNATI OH 45220", "11 MAIN ST CINCINNATI OH 45220", "10 MAIN ST CINCINNATI OH 45229" )) x <- as_addr(c( "99 MAIN ST CINCINNATI OH 45220", "10 OAK ST CINCINNATI OH 45220", "10 MAIN ST CINCINNATI OH 45103" )) out <- addr_match(x, y) addr_match_stage(out)
The structures for addr() and the addr_ classes are
derived as a subset of the United States Thoroughfare, Landmark, and Postal
Address Data Standard that is relevant for residential, numbered thoroughfare
addresses:
Address
├─ AddressNumber
│ ├─ AddressNumberPrefix
│ ├─ AddressNumber
│ ├─ AddressNumberSuffix
├─ StreetName
│ ├─ StreetNamePreModifier
│ ├─ StreetNamePreDirectional
│ ├─ StreetNamePreType
│ ├─ StreetName
│ ├─ StreetNamePostType
│ └─ StreetNamePostDirectional
└─ Place
├─ PlaceName
├─ StateName
└─ ZipCode
addr() combines addr_number(), addr_street(), and addr_place() into a
single addr vector:
<addr> @ number: <addr_number> .. @ prefix .. @ digits .. @ suffix @ street: <addr_street> .. @ predirectional .. @ premodifier .. @ pretype .. @ name .. @ posttype .. @ postdirectional @ place : <addr_place> .. @ name .. @ state .. @ zipcode
addr_number( prefix = NA_character_, digits = NA_character_, suffix = NA_character_ ) addr_street( predirectional = NA_character_, premodifier = NA_character_, pretype = NA_character_, name = NA_character_, posttype = NA_character_, postdirectional = NA_character_, map_posttype = TRUE, map_directional = TRUE, map_pretype = TRUE, map_ordinal = TRUE ) addr_place( name = NA_character_, state = NA_character_, zipcode = NA_character_, map_state = TRUE ) addr(number = addr_number(), street = addr_street(), place = addr_place())addr_number( prefix = NA_character_, digits = NA_character_, suffix = NA_character_ ) addr_street( predirectional = NA_character_, premodifier = NA_character_, pretype = NA_character_, name = NA_character_, posttype = NA_character_, postdirectional = NA_character_, map_posttype = TRUE, map_directional = TRUE, map_pretype = TRUE, map_ordinal = TRUE ) addr_place( name = NA_character_, state = NA_character_, zipcode = NA_character_, map_state = TRUE ) addr(number = addr_number(), street = addr_street(), place = addr_place())
prefix |
address number prefix, often a fractional or grid component |
digits |
primary street number for the address; must be between 0 and 999999 |
suffix |
address number suffix, often a letter or unit-like component |
predirectional |
direction before the street name |
premodifier |
descriptive modifier before the street name |
pretype |
street type or classification before the street name |
name |
street name, or city/town/municipality name for |
posttype |
street type or classification after the street name |
postdirectional |
direction after the street name |
map_posttype |
logical; map posttype to abbreviations? |
map_directional |
logical; map pre- and post-directional to abbreviations? |
map_pretype |
logical; map pretype to abbreviations? |
map_ordinal |
logical; map ordinal street names to abbreviations? |
state |
state or territory abbreviation |
zipcode |
ZIP code (must be five digits not starting with "000") |
map_state |
logical; map state to abbreviations? |
number |
an addr_number vector |
street |
an addr_street vector |
place |
an addr_place vector |
All field values must be character vectors of at least length one (including missing values). Length-one fields are recycled to match the length of other fields.
An addr, addr_number, addr_street, or addr_place vector
# define a new addr_number vector addr_number(digits = "290") addr_number(prefix = "N", digits = "290", suffix = "A") # define a new addr_street vector addr_street(name = "Burnet", posttype = "Ave") # street names are automatically mapped to abbreviations addr_street(predirectional = "North", name = "Fifth", posttype = "Street") # define a new addr_place vector addr_place(name = "Cincinnati", state = "OH", zipcode = "45220") # define a new addr vector addr( addr_number(digits = "290"), addr_street(name = "Burnet", posttype = "Ave"), addr_place(name = "Cincinnati", state = "OH", zipcode = "45229") ) # define a more complicated addr vector # and explicitly specify empty components to avoid NA addr( addr_number(prefix = "", digits = "200", suffix = ""), addr_street( predirectional = "west", premodifier = "Old", pretype = "US", name = "50", posttype = "avenue", postdirectional = "east", map_directional = TRUE, map_pretype = TRUE, map_posttype = TRUE ), addr_place(name = "Cincinnati", state = "ohio", zipcode = "45220") ) # addr_* vectors are recycled and omitted fields are missing addr( addr_number(digits = c("290", "200", "3333", "111")), addr_street( name = c("Burnet", "Main", "Ludlow", "State Route 32"), posttype = c("Ave", "St", "Ave", NA_character_) ), addr_place(name = "Cincinnati", state = "OH") )# define a new addr_number vector addr_number(digits = "290") addr_number(prefix = "N", digits = "290", suffix = "A") # define a new addr_street vector addr_street(name = "Burnet", posttype = "Ave") # street names are automatically mapped to abbreviations addr_street(predirectional = "North", name = "Fifth", posttype = "Street") # define a new addr_place vector addr_place(name = "Cincinnati", state = "OH", zipcode = "45220") # define a new addr vector addr( addr_number(digits = "290"), addr_street(name = "Burnet", posttype = "Ave"), addr_place(name = "Cincinnati", state = "OH", zipcode = "45229") ) # define a more complicated addr vector # and explicitly specify empty components to avoid NA addr( addr_number(prefix = "", digits = "200", suffix = ""), addr_street( predirectional = "west", premodifier = "Old", pretype = "US", name = "50", posttype = "avenue", postdirectional = "east", map_directional = TRUE, map_pretype = TRUE, map_posttype = TRUE ), addr_place(name = "Cincinnati", state = "ohio", zipcode = "45220") ) # addr_* vectors are recycled and omitted fields are missing addr( addr_number(digits = c("290", "200", "3333", "111")), addr_street( name = c("Burnet", "Main", "Ludlow", "State Route 32"), posttype = c("Ave", "St", "Ave", NA_character_) ), addr_place(name = "Cincinnati", state = "OH") )
as_addr() converts other objects into addr() vectors.
See ?addr for more details on its structure.
as_addr(x, ...)as_addr(x, ...)
x |
object to coerce to an addr vector |
... |
additional arguments passed to methods |
character: will be cleaned (if clean = TRUE) with clean_address_text()
and then tagged using usaddress_tag(); tags are normalized to abbreviations
by passing all map_* arguments to addr_street() or addr_place();
ZIP codes parsed with more than five characters are truncated
with a warning, and malformed parsed ZIP codes are set to missing with a
warning; non-numeric characters in parsed
address number digits will be removed with a warning; parsed address number
digits greater than 999999 are truncated to the first six digits with a
warning
data.frame: must have columns named according to fields in
addr_number(), addr_street(), or addr_place(); also passes
the map_* arguments to addr_street() and addr_place()
addr: returned as-is
as_addr(voter_addresses()[1:1000]) data.frame( number_digits = c("290", "200"), street_name = c("Burnet", "Main"), street_posttype = c("Ave", "St"), place_name = c("Cincinnati", "Cincinnati"), place_state = c("OH", "OH"), place_zipcode = c("45229", "45220"), stringsAsFactors = FALSE )|> as_addr()as_addr(voter_addresses()[1:1000]) data.frame( number_digits = c("290", "200"), street_name = c("Burnet", "Main"), street_posttype = c("Ave", "St"), place_name = c("Cincinnati", "Cincinnati"), place_state = c("OH", "OH"), place_zipcode = c("45229", "45220"), stringsAsFactors = FALSE )|> as_addr()
Remove excess whitespace and keep only letters, numbers, #, and -.
clean_address_text(x)clean_address_text(x)
x |
a vector of address character strings |
a vector of cleaned addresses
clean_address_text(c( "3333 Burnet Ave Cincinnati OH 45219", "33_33 Burnet Ave. Cincinnati OH 45219", "33\\33 B\"urnet Ave; Ci!ncinn&*ati OH 45219", "3333 Burnet Ave Cincinnati OH 45219", "33_33 Burnet Ave. Cincinnati OH 45219" ))clean_address_text(c( "3333 Burnet Ave Cincinnati OH 45219", "33_33 Burnet Ave. Cincinnati OH 45219", "33\\33 B\"urnet Ave; Ci!ncinn&*ati OH 45219", "3333 Burnet Ave Cincinnati OH 45219", "33_33 Burnet Ave. Cincinnati OH 45219" ))
county_fips_lookup() uses a package-internal reference derived from the
2025 U.S. Census county adjacency file to translate between county names,
state abbreviations, and 5-digit county FIPS identifiers.
Name lookups accept either the full county-equivalent label
(for example, "Orleans Parish") or a shortened form with common suffixes
removed (for example, "Orleans"). If a shortened form is ambiguous within
a state, the function errors and asks for the full county-equivalent name
or the 5-digit FIPS identifier.
county_fips_lookup(county, state = NULL)county_fips_lookup(county, state = NULL)
county |
character, length one; either a county name or a 5-digit county FIPS identifier |
state |
character, length one; state abbreviation or full state name;
required when |
A tibble with one row and columns county, county_full, state,
and county_fips.
county_fips_lookup("Hamilton", "OH") county_fips_lookup("Hamilton", "Ohio") county_fips_lookup("39061")county_fips_lookup("Hamilton", "OH") county_fips_lookup("Hamilton", "Ohio") county_fips_lookup("39061")
The Cincinnati Eviction Hotspots data was downloaded from Eviction Labs and contains characteristics of the top 100 buildings that are responsible for about 25% of all eviction filings in Cincinnati (from their "current through 8-31-2024" release).
elh_data()elh_data()
https://evictionlab.org/eviction-tracking/cincinnati-oh/
a tibble with 100 rows and 9 columns
elh_data()elh_data()
Fuzzy match strings in x to strings in y using optimized string
alignment (OSA) distance and ignoring capitalization.
fuzzy_match(x, y, osa_max_dist = 1, prefilter = c("none", "psk"))fuzzy_match(x, y, osa_max_dist = 1, prefilter = c("none", "psk"))
x |
character vector to match |
y |
character vector to match to |
osa_max_dist |
maximum OSA distance to consider a match;
|
prefilter |
method used to prefilter |
If multiple strings in y are tied for the minimum OSA distance from a
string in x, all of their indices are included in the return value.
a list of integer vectors representing the position of the best
matching string(s) in y for each string in x
my_names <- c("Pinye", "Pine", "Oalck", "Sunset", "Riverbend", "Greenfild") the_names <- c("Piney", "Pine", "Oak", "Cheshire", "Greenfield", "Maple", "Elm") matches <- fuzzy_match(my_names, the_names, osa_max_dist = 1) matches lapply(matches, \(i) the_names[i]) x <- as_addr(voter_addresses()[1:100])@street@name y <- unique(nad_example_data()$nad_addr@street@name) system.time(fuzzy_match(x, y)) # larger vectors see a speedup when using # phonetic_street_key as a prefilter # but may miss potential matches that are within # osa_max_dist of each other, but did not have # identical phonetic codes (e.g., "woolper" and "woopler") system.time(fuzzy_match(x, y, prefilter = "psk"))my_names <- c("Pinye", "Pine", "Oalck", "Sunset", "Riverbend", "Greenfild") the_names <- c("Piney", "Pine", "Oak", "Cheshire", "Greenfield", "Maple", "Elm") matches <- fuzzy_match(my_names, the_names, osa_max_dist = 1) matches lapply(matches, \(i) the_names[i]) x <- as_addr(voter_addresses()[1:100])@street@name y <- unique(nad_example_data()$nad_addr@street@name) system.time(fuzzy_match(x, y)) # larger vectors see a speedup when using # phonetic_street_key as a prefilter # but may miss potential matches that are within # osa_max_dist of each other, but did not have # identical phonetic codes (e.g., "woolper" and "woopler") system.time(fuzzy_match(x, y, prefilter = "psk"))
geocode() geocodes addr vectors using Census TIGER address
features (see ?taf) by:
searching for a matching street (see ?match_addr_street),
within the same ZIP code, also searching similar ZIP codes for a matching
street if necessary
using the address number to select the best address feature range and side of the street (even/odd), breaking ties on smallest width and spread
linearly interpolating a geographic point along the best range line based on the actual and potential range of address numbers
offsetting the interpolated point from the range line perpendicularly
Only matched input addresses return non-missing matched ZIP code and street
values. Missing or unmatched ZIP codes return missing matched ZIP code,
street, geography, and s2 cell values. If all ranges on the matched ZIP code
and street exclude the address number, only the geography and s2 cell values
return NA.
geocode( x, name_phonetic_dist = 1L, name_fuzzy_dist = 2L, match_street_type = c("exact", "compatible", "ignore"), match_street_directional = c("exact", "swap", "ignore"), zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap"), year = as.character(2025:2011), version = "v1", taf_install = TRUE, taf_redownload = FALSE, offset = 10L, progress = interactive() ) geocode_zip( x, offset = 10L, name_phonetic_dist = 1L, name_fuzzy_dist = 2L, match_street_type = c("exact", "compatible", "ignore"), match_street_directional = c("exact", "swap", "ignore"), zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap"), year = as.character(2025:2011), version = "v1", taf_install = TRUE, taf_redownload = FALSE, progress_callback = NULL, taf_check = TRUE )geocode( x, name_phonetic_dist = 1L, name_fuzzy_dist = 2L, match_street_type = c("exact", "compatible", "ignore"), match_street_directional = c("exact", "swap", "ignore"), zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap"), year = as.character(2025:2011), version = "v1", taf_install = TRUE, taf_redownload = FALSE, offset = 10L, progress = interactive() ) geocode_zip( x, offset = 10L, name_phonetic_dist = 1L, name_fuzzy_dist = 2L, match_street_type = c("exact", "compatible", "ignore"), match_street_directional = c("exact", "swap", "ignore"), zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap"), year = as.character(2025:2011), version = "v1", taf_install = TRUE, taf_redownload = FALSE, progress_callback = NULL, taf_check = TRUE )
x |
an addr vector ( |
name_phonetic_dist |
integer; maximum optimized string alignment
distance between |
name_fuzzy_dist |
integer; maximum optimized string alignment distance
between |
match_street_type |
character; how to compare street pretype and
posttype when selecting street candidates. |
match_street_directional |
character; how to compare street
predirectional and postdirectional when selecting street candidates.
|
zip_variants |
logical; fuzzy match to common variants of
|
zip_variant |
character vector; zipcode variant types to use when
|
year |
integer, length one; vintage of TIGER addrfeat (address feature) files |
version |
character, length one; major version of the package and taf dataset schema |
taf_install |
logical; install missing county TAF files needed for
input ZIP codes and selected ZIP code variants before geocoding? If
|
taf_redownload |
logical; re-download cached TIGER ZIP files when installing missing TAF counties? |
offset |
number of meters to offset geocode from street line |
progress |
logical; show a ZIP-code progress bar while geocoding? |
progress_callback |
optional callback used internally by |
taf_check |
logical; check for missing TAF counties? Used internally
by |
geocode_zip() is the workhorse function and operates on addr vectors
with the same ZIP code; use geocode() to geocode an addr vector
with multiple ZIP codes by grouping them by ZIP code and processing
serially by default.
At a lower level, grouping addr vectors by ZIP code and applying
geocode_zip() facilitates more control (e.g., parallel processing).
If the mirai package is installed and mirai daemons have already been
configured by the caller, geocode() uses them for ZIP-code-level
parallel processing. Otherwise it falls back to sequential processing.
geocode() and geocode_zip() both download and install tiger address
features by county (?taf_install) as needed based on the input addr ZIP
codes (and possibly ZIP code variants). TAF install checks run before
reading TAF ZIP files so parallel geocoding workers do not try to download
county files at the same time.
A tibble with columns addr (the input addr vector),
matched_zipcode (character vector), matched_street (addr_street
vector), matched_geography (s2_geography point vector), and s2_cell
(s2_cell vector).
x <- as_addr(voter_addresses()[1:100]) # for example purposes, only install one county Sys.setenv("R_USER_DATA_DIR" = tempfile()) taf_install("39061", "2025") # and geocode without installing other counties gcd <- geocode(x, taf_install = FALSE) # this is only for example purposes and usually not required; e.g. ## Not run: gcd <- geocode(x) ## End(Not run) gcd table(geocode_stage(gcd)) geocode_table(gcd) leaflet::leaflet(wk::wk_coords(gcd$matched_geography)) |> leaflet::addTiles() |> leaflet::addCircleMarkers(lng = ~x, lat = ~y, label = ~feature_id) # use mirai for parallel processing ## Not run: mirai::daemons(2) geocode(x) mirai::daemons(0) ## End(Not run)x <- as_addr(voter_addresses()[1:100]) # for example purposes, only install one county Sys.setenv("R_USER_DATA_DIR" = tempfile()) taf_install("39061", "2025") # and geocode without installing other counties gcd <- geocode(x, taf_install = FALSE) # this is only for example purposes and usually not required; e.g. ## Not run: gcd <- geocode(x) ## End(Not run) gcd table(geocode_stage(gcd)) geocode_table(gcd) leaflet::leaflet(wk::wk_coords(gcd$matched_geography)) |> leaflet::addTiles() |> leaflet::addCircleMarkers(lng = ~x, lat = ~y, label = ~feature_id) # use mirai for parallel processing ## Not run: mirai::daemons(2) geocode(x) mirai::daemons(0) ## End(Not run)
Classify geocode results into staged outcomes returned by geocode():
no match, street match, or interpolated range match, distinguishing exact
ZIP-code matches from ZIP-code variant matches.
geocode_stage(x)geocode_stage(x)
x |
a data frame returned by |
an ordered factor with levels none, street_variant, street,
range_variant, range
geocode_table() converts the rich output from geocode() to a flat table
with only JSON-safe column types.
geocode_table(x)geocode_table(x)
x |
a data frame returned by |
A tibble with atomic columns suitable for JSON serialization.
geocode_table() includes the input address, geocode stage, matched ZIP
code, matched street, and S2 cell as character columns.
A single addr_number in y is chosen for each addr_number in x.
If exact matches (using as.character) are not found,
possible matches within number_fuzzy_dist are searched for in y.
If multiple matches are present in y, the selected match has the
lowest absolute numeric difference from @digits in x; ties are broken
by optimized string alignment (OSA) distance and then by lexicographic
order with digits preceding alphabetic characters.
addr_number objects with missing @digits, or with empty strings
for all of @prefix, @digits, and @suffix, are not matched and
returned as missing instead.
match_addr_number(x, y, number_fuzzy_dist = 1L)match_addr_number(x, y, number_fuzzy_dist = 1L)
x, y
|
addr_number vectors to match |
number_fuzzy_dist |
integer; maximum optimized string alignment
distance between |
An addr_number vector, the same length as x, containing the
selected match in y for each element of x. Unmatched elements are
returned as missing addr_number() values.
x <- addr_number( prefix = "", digits = as.character(c(1, 10, 228, 11, 22, 22, 22, 10, 99897, NA)), suffix = "" ) y <- addr_number( prefix = "", digits = as.character(c(12, 11, 10, 22)), suffix = "" ) match_addr_number(x, y) match_addr_number(x, y, number_fuzzy_dist = 0L)x <- addr_number( prefix = "", digits = as.character(c(1, 10, 228, 11, 22, 22, 22, 10, 99897, NA)), suffix = "" ) y <- addr_number( prefix = "", digits = as.character(c(12, 11, 10, 22)), suffix = "" ) match_addr_number(x, y) match_addr_number(x, y, number_fuzzy_dist = 0L)
A single addr_street in y is chosen for each addr_street in x.
If exact matches (using as.character) are not found,
candidate matches are chosen by
fuzzy matching on street name (using phonetic street key and street name)
and matching the street type and directional components according to
match_street_type and match_street_directional.
Ordinal street names use restricted phonetic candidates:
an ordinal phonetic key like #0007 may fuzzy match only to plausible
ordinal neighbors such as digit shifts (#0070, #0700, #7000)
or same-width substitutions (#0008, #0009), not arbitrary
OSA-distance-one ordinal keys such as #0017 or #0077.
If multiple candidates remain after fuzzy matching, the first candidate in
y is returned.
addr_street objects with missing or empty @name are not matched and
returned as missing instead.
match_addr_street( x, y, name_phonetic_dist = 1L, name_fuzzy_dist = 2L, match_street_type = c("exact", "compatible", "ignore"), match_street_directional = c("exact", "swap", "ignore") )match_addr_street( x, y, name_phonetic_dist = 1L, name_fuzzy_dist = 2L, match_street_type = c("exact", "compatible", "ignore"), match_street_directional = c("exact", "swap", "ignore") )
x, y
|
addr_street vectors to match |
name_phonetic_dist |
integer; maximum optimized string alignment
distance between |
name_fuzzy_dist |
integer; maximum optimized string alignment distance
between |
match_street_type |
character; how to compare street pretype and
posttype when selecting street candidates. |
match_street_directional |
character; how to compare street
predirectional and postdirectional when selecting street candidates.
|
An addr_street vector, the same length as x, containing the
selected match in y for each element of x. Unmatched elements are
returned as missing addr_street() values.
my_streets <- addr_street( predirectional = "", premodifier = "", pretype = "", name = c("Beechview", "Vivian", "Springfield", "Round Bottom", "Pfeiffer", "Beachview", "Vevan", "Srpingfield", "Square Top", "Pfeffer", "Wuhlper", ""), posttype = c("Cir", "Pl", "Pike", "Rd", "Rd", "Cir", "Pl", "Pike", "Rd", "Rd", "Ave", ""), postdirectional = "" ) the_streets <- nad_example_data()$nad_addr@street match_addr_street(my_streets, the_streets) toggle_y <- addr_street( predirectional = c("E", "", "", "E"), premodifier = "", pretype = c("", "", "US Hwy", "US Hwy"), name = c("14th", "Oak", "Main", "Main"), posttype = c("St", "Rd", "Rd", "Rd"), postdirectional = c("", "", "", "E"), map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ) # directionals are required by default, so blank "14th St" stays unmatched format(match_addr_street( addr_street( predirectional = "", premodifier = "", pretype = "", name = "14th", posttype = "St", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y )) format(match_addr_street( addr_street( predirectional = "", premodifier = "", pretype = "", name = "14th", posttype = "St", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, match_street_directional = "ignore" )) # type can also be ignored during fuzzy street-name matching format(match_addr_street( addr_street( predirectional = "", premodifier = "", pretype = "", name = "Oka", posttype = "Ave", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, name_fuzzy_dist = 1L )) format(match_addr_street( addr_street( predirectional = "", premodifier = "", pretype = "", name = "Oka", posttype = "Ave", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, name_fuzzy_dist = 1L, match_street_type = "ignore" )) # compatible type matching allows blanks to stand in for unknown type fields type_y <- addr_street( predirectional = "", premodifier = "", pretype = c("Ave", "Rd"), name = "Main", posttype = "", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ) format(match_addr_street( addr_street( predirectional = "", premodifier = "", pretype = "", name = "Main", posttype = "Ave", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), type_y, match_street_type = "compatible" )) # type and directional matching can be relaxed independently format(match_addr_street( addr_street( predirectional = "E", premodifier = "", pretype = "US Hwy", name = "Mian", posttype = "Rd", postdirectional = "E", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, match_street_type = "ignore", name_fuzzy_dist = 1L )) format(match_addr_street( addr_street( predirectional = "E", premodifier = "", pretype = "US Hwy", name = "Mian", posttype = "Rd", postdirectional = "E", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, name_fuzzy_dist = 1L )) format(match_addr_street( addr_street( predirectional = "E", premodifier = "", pretype = "US Hwy", name = "Mian", posttype = "Rd", postdirectional = "E", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, name_fuzzy_dist = 1L, match_street_type = "exact", match_street_directional = "exact" ))my_streets <- addr_street( predirectional = "", premodifier = "", pretype = "", name = c("Beechview", "Vivian", "Springfield", "Round Bottom", "Pfeiffer", "Beachview", "Vevan", "Srpingfield", "Square Top", "Pfeffer", "Wuhlper", ""), posttype = c("Cir", "Pl", "Pike", "Rd", "Rd", "Cir", "Pl", "Pike", "Rd", "Rd", "Ave", ""), postdirectional = "" ) the_streets <- nad_example_data()$nad_addr@street match_addr_street(my_streets, the_streets) toggle_y <- addr_street( predirectional = c("E", "", "", "E"), premodifier = "", pretype = c("", "", "US Hwy", "US Hwy"), name = c("14th", "Oak", "Main", "Main"), posttype = c("St", "Rd", "Rd", "Rd"), postdirectional = c("", "", "", "E"), map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ) # directionals are required by default, so blank "14th St" stays unmatched format(match_addr_street( addr_street( predirectional = "", premodifier = "", pretype = "", name = "14th", posttype = "St", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y )) format(match_addr_street( addr_street( predirectional = "", premodifier = "", pretype = "", name = "14th", posttype = "St", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, match_street_directional = "ignore" )) # type can also be ignored during fuzzy street-name matching format(match_addr_street( addr_street( predirectional = "", premodifier = "", pretype = "", name = "Oka", posttype = "Ave", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, name_fuzzy_dist = 1L )) format(match_addr_street( addr_street( predirectional = "", premodifier = "", pretype = "", name = "Oka", posttype = "Ave", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, name_fuzzy_dist = 1L, match_street_type = "ignore" )) # compatible type matching allows blanks to stand in for unknown type fields type_y <- addr_street( predirectional = "", premodifier = "", pretype = c("Ave", "Rd"), name = "Main", posttype = "", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ) format(match_addr_street( addr_street( predirectional = "", premodifier = "", pretype = "", name = "Main", posttype = "Ave", postdirectional = "", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), type_y, match_street_type = "compatible" )) # type and directional matching can be relaxed independently format(match_addr_street( addr_street( predirectional = "E", premodifier = "", pretype = "US Hwy", name = "Mian", posttype = "Rd", postdirectional = "E", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, match_street_type = "ignore", name_fuzzy_dist = 1L )) format(match_addr_street( addr_street( predirectional = "E", premodifier = "", pretype = "US Hwy", name = "Mian", posttype = "Rd", postdirectional = "E", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, name_fuzzy_dist = 1L )) format(match_addr_street( addr_street( predirectional = "E", premodifier = "", pretype = "US Hwy", name = "Mian", posttype = "Rd", postdirectional = "E", map_pretype = FALSE, map_posttype = FALSE, map_directional = FALSE, map_ordinal = FALSE ), toggle_y, name_fuzzy_dist = 1L, match_street_type = "exact", match_street_directional = "exact" ))
A single ZIP code in y is chosen for each ZIP code in x.
By default, if exact matches are not found, common variants of ZIP codes
in x are searched for in y (?zipcode_variant)
If multiple variants are present in y, the selected match has the lowest
absolute numeric difference from the ZIP code in x; ties are broken by
OSA string distance and then by the
minimum number.
match_zipcodes( x, y, zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap") )match_zipcodes( x, y, zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap") )
x, y
|
character vectors of ZIP codes to match |
zip_variants |
logical; fuzzy match to common variants of
|
zip_variant |
character vector; zipcode variant types to use when
|
A character vector, the same length as x, containing the selected
match in y for each ZIP code in x.
match_zipcodes( c("45222", "45219", "45219", "45220", "45220", "", NA), c("42522", "45200", "45219", "45221", "45223", "45321", "") ) match_zipcodes( c("45222", "45219", "45219", "45220", "45220", "", NA), c("42522", "45200", "45219", "45221", "45223", "45321", ""), zip_variants = FALSE )match_zipcodes( c("45222", "45219", "45219", "45220", "45220", "", NA), c("42522", "45200", "45219", "45221", "45223", "45321", "") ) match_zipcodes( c("45222", "45219", "45219", "45220", "45220", "", NA), c("42522", "45200", "45219", "45221", "45223", "45321", ""), zip_variants = FALSE )
The U.S. Department of Transportation partners with address programs from state, local, and tribal governments to compile their authoritative data into a database. Find more information here: https://www.transportation.gov/gis/national-address-database
nad_read() reads data from the NAD geodatabase by county,
using source data already downloaded with nad_download() or downloading
it when refresh_source = "yes", and readies it for R.
Counties can be identified either by county name plus state, or by a
5-digit county FIPS identifier. County names and state abbreviations are
resolved internally and still determine the cache path and source query.
The NAD geodatabase has a very large size on disk (~10 GB).
Data binaries are the cached outputs of nad_read() for each
County/State and are created on first run with nad().
Download data binaries to the tools::R_user_dir() data directory, or
point R to these files on disk, to read NAD tables without downloading the
nationwide NAD geodatabase.
(Files are organized by major package version,
NAD version, state, and named by county; e.g., see
list.files(tools::R_user_dir("addr", "data"), recursive = TRUE))
nad( county, state = NULL, version = 22L, refresh_binary = c("yes", "no", "force"), refresh_source = c("no", "yes", "force") ) nad_read( county, state = NULL, version = 22L, refresh_source = c("no", "yes", "force") ) nad_download(version = 22L, refresh_source = c("yes", "no", "force"))nad( county, state = NULL, version = 22L, refresh_binary = c("yes", "no", "force"), refresh_source = c("no", "yes", "force") ) nad_read( county, state = NULL, version = 22L, refresh_source = c("no", "yes", "force") ) nad_download(version = 22L, refresh_source = c("yes", "no", "force"))
county |
character, length one; county name or 5-digit county FIPS identifier |
state |
character, length one; name or abbreviation of state. Required
when |
version |
integer, length one; NAD revision to use. Defaults to |
refresh_binary |
character, length one; choose how to refresh NAD data binaries cached on disk if not already present; "yes" will create data binary if not already present, "no" will error if data binary is not already present, "force" will create the data binary and overwrite any existing data binary |
refresh_source |
character, length one; choose how to refresh NAD source geodatabase on disk if not already present; "yes" will download the geodatabase if not already present, "no" will error if the file does not already exist, "force" will download and overwrite any existing geodatabase |
NAD source geodatabases are downloaded from the transportation.gov data
portal:
https://data.transportation.gov/d/yw36-suxr
Downloads use the R curl package and resume from any interrupted
partial download left in the addr user data directory.
If the download cannot complete, nad_download() will also work with a
NAD ZIP file that was downloaded another way and placed where
tools::R_user_dir("addr", "data") can find it.
For the original schema, see
https://www.transportation.gov/sites/dot.gov/files/2023-07/NAD_Schema_202304.pdf
Before downloading, please read the disclaimer here:
https://www.transportation.gov/mission/open/gis/national-address-database/national-address-database-nad-disclaimer
Investigate individual address points in the online viewer: https://usdot.maps.arcgis.com/apps/instant/portfolio/index.html?appid=59f7e4fb71994d13b61f424e21a6cffe
The NAD does not distinguish between empty and missing address components.
When reading into R, all missing address components are replaced with an
empty string ("") except for address number (digits), street name,
and ZIP code.
Addresses with malformed ZIP codes are removed.
# explicitly download source data, then cache county output on first read ## Not run: nad_download(version = 22L) nad("Butler", "OH") nad("39017") ## End(Not run) # example data preloaded for Hamilton County, OH # works without downloading NAD gdb first Sys.setenv(R_USER_DATA_DIR = tempfile()) nad("Hamilton", "OH", refresh_source = "no", refresh_binary = "no") nad("39061", refresh_source = "no", refresh_binary = "no")# explicitly download source data, then cache county output on first read ## Not run: nad_download(version = 22L) nad("Butler", "OH") nad("39017") ## End(Not run) # example data preloaded for Hamilton County, OH # works without downloading NAD gdb first Sys.setenv(R_USER_DATA_DIR = tempfile()) nad("Hamilton", "OH", refresh_source = "no", refresh_binary = "no") nad("39061", refresh_source = "no", refresh_binary = "no")
An example of the data returned using nad() for
Hamilton County, Ohio (NAD version 22L). See ?nad for more
information about the National Address Database.
nad("Hamilton", "OH", refresh_source = "no", refresh_binary = "no")
and nad("39061", refresh_source = "no", refresh_binary = "no") are
equivalent to nad_example_data().
nad_example_data(match_prepared = FALSE)nad_example_data(match_prepared = FALSE)
match_prepared |
logical; return the example data preprocessed with
|
If match_prepared = FALSE, a tibble with 349,407 rows and 7
columns. If match_prepared = TRUE, an addr_match_index.
nad_example_data() nad_example_data(match_prepared = TRUE)nad_example_data() nad_example_data(match_prepared = TRUE)
Ordinal street names (e.g., "11TH", "5TH") are encoded as zero-padded numeric
identifiers with a special prefix, while non-ordinal street names are encoded
using a Soundex phonetic code (see ?stringdist::phonetic).
Ordinal words (e.g., "Eleventh", "Fifth") are
detected and converted automatically.
Each phonetic key is exactly four characters long.
phonetic_street_key(x)phonetic_street_key(x)
x |
character vector |
character vector
phonetic_street_key( c("MEADOWLARK", "TOWNSEND", "IMMACULATE", "7TH", "WERK", "PAXTON", "5th", "BURNET", "FIFTH", "CLIFTON") )phonetic_street_key( c("MEADOWLARK", "TOWNSEND", "IMMACULATE", "7TH", "WERK", "PAXTON", "5th", "BURNET", "FIFTH", "CLIFTON") )
Opens a Shiny app that shows how an input address is tagged with
tag_usaddress(), normalized by as_addr(), and then matched in stages
against nad_example_data().
run_addr_explorer(launch.browser = interactive())run_addr_explorer(launch.browser = interactive())
launch.browser |
logical; passed to |
Invisibly returns the result of shiny::runApp()
## Not run: run_addr_explorer() ## End(Not run)## Not run: run_addr_explorer() ## End(Not run)
Opens a minimal Shiny app that geocodes one typed address with
geocode() and maps the result with leaflet.
run_geocode_explorer(launch.browser = interactive())run_geocode_explorer(launch.browser = interactive())
launch.browser |
logical; passed to |
Invisibly returns the result of shiny::runApp()
## Not run: run_geocode_explorer() ## End(Not run)## Not run: run_geocode_explorer() ## End(Not run)
taf() uses the arrow package to open the hive-partitioned parquet dataset
of TIGER address features in the addr user data directory.
Arrow FileSystemDataset objects are database-like backends for
larger-than-memory datasets and support dplyr syntax for data manipulation;
see https://arrow.apache.org/docs/r/articles/data_wrangling.html.
Other TAF helpers such as taf_catalog(), taf_install(), and taf_zip()
use nanoparquet directly for flat parquet file reads and writes. Arrow is
only required for the advanced dataset interface returned by taf().
taf(year = as.character(2025:2011), version = "v1") taf_install( county, year = as.character(2025:2011), version = "v1", overwrite = FALSE, redownload = FALSE )taf(year = as.character(2025:2011), version = "v1") taf_install( county, year = as.character(2025:2011), version = "v1", overwrite = FALSE, redownload = FALSE )
year |
integer, length one; vintage of TIGER addrfeat (address feature) files |
version |
character, length one; major version of the package and taf dataset schema |
county |
character, length 1; county FIPS code |
overwrite |
logical, length 1; overwrite an existing county install? |
redownload |
logical, length 1; re-download cached TIGER ZIP files? |
taf_install() downloads and links TIGER address features and
feature names for a specific year and county, installing the resulting
file in the addr user data directory.
About 6% of ADDRFEAT rows do not have a county-local primary FEATNAMES
match by LINEARID. In these cases, street tags are parsed from the
ADDRFEAT full name, and the street_tag_parsed column is set to TRUE.
a Dataset R6 object (see ?arrow::open_dataset); use dplyr
verbs to query the data and get results, see examples
Sys.setenv("R_USER_DATA_DIR" = tempfile()) taf_install("39061", "2025") taf() # use dplyr verbs to query library(dplyr, warn.conflicts = FALSE) # find top ten most frequent street name-posttype combinations taf() |> group_by(street_name, street_posttype) |> summarize( n_zips = n_distinct(ZIP), n_ranges = n(), .groups = "drop" ) |> arrange(desc(n_zips), desc(n_ranges)) |> collect() |> slice(1:10) Sys.setenv("R_USER_DATA_DIR" = tempfile()) taf_install("39061", "2025")Sys.setenv("R_USER_DATA_DIR" = tempfile()) taf_install("39061", "2025") taf() # use dplyr verbs to query library(dplyr, warn.conflicts = FALSE) # find top ten most frequent street name-posttype combinations taf() |> group_by(street_name, street_posttype) |> summarize( n_zips = n_distinct(ZIP), n_ranges = n(), .groups = "drop" ) |> arrange(desc(n_zips), desc(n_ranges)) |> collect() |> slice(1:10) Sys.setenv("R_USER_DATA_DIR" = tempfile()) taf_install("39061", "2025")
taf_catalog() reads a TIGER-derived catalog of ZIP codes present in each
county's TIGER address feature file for a specific year and addr TAF schema
version. The catalog is installed with the package and is used to plan which
county TAF files may be needed for a set of ZIP codes. It is separate from
the local install manifest, which records only files installed on the current
machine.
taf_catalog(year = as.character(2025:2011), version = "v1")taf_catalog(year = as.character(2025:2011), version = "v1")
year |
integer, length one; vintage of TIGER addrfeat (address feature) files |
version |
character, length one; major version of the package and taf dataset schema |
a tibble with county_fips, ZIP, zip3, zip2, and n_ranges
columns
taf_catalog("2025")taf_catalog("2025")
taf_needed_counties() uses taf_catalog() to identify county TAF files
that may contain address ranges for ZIP codes in x, including selected ZIP
code variants when requested. taf_ensure() installs any of those counties
that are not already present in the local TAF manifest.
taf_needed_counties( x, year = as.character(2025:2011), version = "v1", zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap") ) taf_ensure( x, year = as.character(2025:2011), version = "v1", zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap"), redownload = FALSE )taf_needed_counties( x, year = as.character(2025:2011), version = "v1", zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap") ) taf_ensure( x, year = as.character(2025:2011), version = "v1", zip_variants = TRUE, zip_variant = c("minus1", "plus1", "sub5", "sub4", "swap"), redownload = FALSE )
x |
an addr vector ( |
year |
integer, length one; vintage of TIGER addrfeat (address feature) files |
version |
character, length one; major version of the package and taf dataset schema |
zip_variants |
logical; fuzzy match to common variants of
|
zip_variant |
character vector; zipcode variant types to use when
|
redownload |
logical, length 1; re-download cached TIGER ZIP files? |
taf_needed_counties() returns a tibble with catalog columns plus
source_zip and source_zip_variant. taf_ensure() invisibly returns the
subset of needed counties that were missing before installation.
taf_needed_counties(as_addr("10 MAIN ST CINCINNATI OH 45220"))taf_needed_counties(as_addr("10 MAIN ST CINCINNATI OH 45220"))
taf_zip() reads and transforms taf() data for a subset of ZIP codes.
It reconstructs the county_fips, s2_geography, and addr_street
vectors in the returned data frame.
taf_zip(x, map = TRUE, year = as.character(2025:2011), version = "v1")taf_zip(x, map = TRUE, year = as.character(2025:2011), version = "v1")
x |
character vector of five-digit ZIP codes |
map |
logical, length 1; map street tags read from taf() data
(type, directional, ordinal) when converting to |
year |
character, length 1; vintage of TIGER addrfeat (address feature) files |
version |
character, length 1; major version of the package and taf dataset schema |
a tibble with LINEARID, FULLNAME, side, ZIP,
FROMHN, TOHN, PARITY, OFFSET, s2_geography, addr_street,
county_fips, and street_tag_parsed columns
Sys.setenv("R_USER_DATA_DIR" = tempfile()) taf_install("39061", "2025") taf_zip(c("45249", "45230", "45220"))Sys.setenv("R_USER_DATA_DIR" = tempfile()) taf_install("39061", "2025") taf_zip(c("45249", "45230", "45220"))
Addresses are tagged using the usaddress conditional random field in a rust port of usaddress. Possible address labels include:
AddressNumberPrefix
AddressNumberSuffix
AddressNumber
BuildingName
CornerOf
IntersectionSeparator
LandmarkName
NotAddress
OccupancyIdentifier
OccupancyType
PlaceName
Recipient
StateName
StreetNamePostDirectional
StreetNamePostType
StreetNamePreDirectional
StreetNamePreModifier
StreetNamePreType
StreetName
SubaddressIdentifier
SubaddressType
USPSBoxGroupID
USPSBoxGroupType
USPSBoxID
USPSBoxType
ZipCode
Find more information about the definitions at https://www.fgdc.gov/standards/projects/address-data
tag_usaddress(x = NA_character_, clean = TRUE)tag_usaddress(x = NA_character_, clean = TRUE)
x |
character string of addresses |
clean |
logical; clean address text with clean_address_text() before tagging? |
a list of vectors of named address tags
tag_usaddress( c("290 Ludlow Avenue Apt 2 Cincinnati OH 45220", "3333 Burnet Ave Cincinnati Ohio 45219", "120 North Main Street, Greenville, SC 29601", "200 Southwest North Street, Topeka, KS 66603", "215 Highway 88 Road, Jackson, CA 95642" ) ) # edge cases! tag_usaddress( c( "1600 Pennsylvania Avenue NW, Washington, DC 20500", # post-directional quadrant "1 Infinite Loop, Cupertino, CA 95014", # corporate campus street name "210 East 400 South, Salt Lake City, UT 84111", # grid addressing (Utah) "N6W23001 Bluemound Road, Wauwatosa, WI 53226", # address number prefix grid (Wisconsin) "350 Fifth Avenue, New York, NY 10118", # ordinal street name "4059 Mt Lee Drive, Hollywood, CA 90068", # abbreviated street element "233 South Wacker Drive, Chicago, IL 60606", # pre-directional "700 Exposition Park Drive, Los Angeles, CA 90037", # multi-word street name "2 South Biscayne Boulevard, Miami, FL 33131" # directional + boulevard ) )tag_usaddress( c("290 Ludlow Avenue Apt 2 Cincinnati OH 45220", "3333 Burnet Ave Cincinnati Ohio 45219", "120 North Main Street, Greenville, SC 29601", "200 Southwest North Street, Topeka, KS 66603", "215 Highway 88 Road, Jackson, CA 95642" ) ) # edge cases! tag_usaddress( c( "1600 Pennsylvania Avenue NW, Washington, DC 20500", # post-directional quadrant "1 Infinite Loop, Cupertino, CA 95014", # corporate campus street name "210 East 400 South, Salt Lake City, UT 84111", # grid addressing (Utah) "N6W23001 Bluemound Road, Wauwatosa, WI 53226", # address number prefix grid (Wisconsin) "350 Fifth Avenue, New York, NY 10118", # ordinal street name "4059 Mt Lee Drive, Hollywood, CA 90068", # abbreviated street element "233 South Wacker Drive, Chicago, IL 60606", # pre-directional "700 Exposition Park Drive, Los Angeles, CA 90037", # multi-word street name "2 South Biscayne Boulevard, Miami, FL 33131" # directional + boulevard ) )
TIGER address features (street address ranges) are read from compressed addrfeat (address feature) shapefiles for each county and Census vintage. If not already present, compressed addrfeat shapefiles are downloaded from the Census FTP site to the addr user data directory.
When reading into R, the data is converted to one row per street side
(L/R) for use by taf_install().
tiger_addr_feat(county, year, redownload = FALSE)tiger_addr_feat(county, year, redownload = FALSE)
county |
character string of county FIPS identifier |
year |
character year of the Census TIGER/Line product |
redownload |
logical, length 1; re-download the cached TIGER ZIP file? |
a tibble with LINEARID, FULLNAME, side, ZIP,
FROMHN, TOHN, PARITY, OFFSET, and s2_geography columns
tiger_addr_feat("39061", "2025")tiger_addr_feat("39061", "2025")
TIGER primary feature names are read from compressed feature-name databases for each county and Census vintage. If not already present, compressed addrfeat (address feature) shapefiles are downloaded from the Census FTP site to the addr user data directory.
When reading into R, the data is filtered to addressable MTFCCs (S1100, S1200, S1400, S1640) that have a name.
tiger_feat_names(county, year, redownload = FALSE)tiger_feat_names(county, year, redownload = FALSE)
county |
character string of county FIPS identifier |
year |
character year of the Census TIGER/Line product |
redownload |
logical, length 1; re-download the cached TIGER ZIP file? |
a tibble with unique LINEARID and addr columns
tiger_feat_names("39061", "2025")tiger_feat_names("39061", "2025")
voter_addresses() returns an example character vector of real-world
addresses downloaded from the Hamilton County, Ohio voter registration
database on 2024-09-12. AddressPreDirectional, AddressNumber,
AddressStreet, AddressSuffix, CityName, "OH", and AddressZip
were pasted together to create 242,133 unique registered-voter addresses.
voter_addresses()voter_addresses()
a character vector
voter_addresses() |> head()voter_addresses() |> head()
An input ZIP code is used to generate variants (for, e.g., 45220):
minus1: subtracting one from zipcode (45219)
plus1: adding one to zipcode (45221)
sub5: substituting the fifth digit of the ZIP code (45221, 45222, 45223, 45224, 45225, 45226, 45227, 45228, 45229)
sub4: substituting the fourth digit of the ZIP code (45200, 45210, 45230, 45240, 45250, 45260, 45270, 45280, 45290)
swap: swapping the second and third digits of the ZIP code (42520)
More than one variant type can be created at once and variants will be returned in the same order as they were requested (see examples).
zipcode_variant(x, variant = c("minus1", "plus1", "sub5", "sub4", "swap"))zipcode_variant(x, variant = c("minus1", "plus1", "sub5", "sub4", "swap"))
x |
character length one; five digit ZIP code |
variant |
character one or more variants to create; see description |
character vector of five digit ZIP code variants
zipcode_variant("45220") # order matters! zipcode_variant("45220", c("minus1", "plus1")) zipcode_variant("45220", c("plus1", "minus1")) zipcode_variant("45220", "sub5") zipcode_variant("45220", "swap")zipcode_variant("45220") # order matters! zipcode_variant("45220", c("minus1", "plus1")) zipcode_variant("45220", c("plus1", "minus1")) zipcode_variant("45220", "sub5") zipcode_variant("45220", "swap")