Here’s a fun common task. I have a data set that has a bunch of codes like:
Name | Abbr | Code |
---|---|---|
Alabama | AL | 01 |
Alaska | AK | 02 |
Arizona | AZ | 04 |
Arkansas | AR | 05 |
California | CA | 06 |
Colorado | CO | 08 |
Connecticut | CT | 09 |
Delaware | DE | 10 |
All of your data is labeled with the code
value. In this case, you want to do a join
so that you can use the actual names because it’s 2017 and we’re not animals.
But what if your data, like the accounting data we deal with at Allovue, has lots of code fields. You probably either have one table that contains all of the look ups in “long” format, where there is a column that represents which column in your data the code is for like this:
code | type | name |
---|---|---|
01 | fips | Alabama |
02 | fips | Alaska |
Alternatively, you may have a lookup table per data element (so one called fips, one called account, one called function, etc).
I bet most folks do the following in this scenario:
account <- left_join(account, account_lookup)
account <- left_join(account, fips)
## Maybe this instead ##
account %<>%
left_join(account_lookup) %>%
left_join(fips)
I want to encourage you to do this a little different using purrr
. Here’s some annotated code that uses reduce_right
to make magic.
# Load a directory of .csv files that has each of the lookup tables
lookups <- map(dir('data/lookups'), read.csv, stringsAsFactors = FALSE)
# Alternatively if you have a single lookup table with code_type as your
# data attribute you're looking up
# lookups <- split(lookups, code_type)
lookups$real_data <- read.csv('data/real_data.csv',
stringsAsFactors = FALSE)
real_data <- reduce_right(lookups, left_join)
Boom, now you went from data with attributes like funds_code
, function_code
, state_code
to data that also has funds_name
, function_name
, state_name
1. What’s great is that this same code can be reused no matter how many fields require a hookup. I’m oftent dealing with accounting data where the accounts are defined by a different number of data fields, but my code doesn’t care at all.
-
My recommendation is to use consistent naming conventions like
_code
and_name
so that knowing how to do the lookups is really straightforward. This is not unlike the convention with Microsoft SQL where the primary key of a table is namedId
and a foreign key to that table is namedTableNameId
. Anything that helps you figure out how to put things together without thinking is worth it. ↩︎