maiz y cacao. @westminsterfox is like one of three POC sitting here. #stilldelicious
Now it’s spring.
I have been fascinated with assertive programming in R since this from 2015 1. Tony Fischetti wrote a great blog post to announce assertr
2.0’s release on CRAN that really clarified the package’s design.
UseRs often do crazy things that no sane developer in another language would do. Today I decided to build a way to check foreign key constraints in R to help me learn the assertr
package.
What do you mean, foreign key constraints?
Well, in many ways this is an extension of my last post on using purrr::reduce
. I have a set of data with codes (like FIPS codes, or user ids, etc) and I want to make sure that all of those codes are “real” codes (as in I have a defintion for that value). So I may have a FIPS code data.frame
with fips_code
and name
as the columns or a user data.frame
with columns id
, fname
, lname
, email
.
In a database, I might have a foreign key constraint on my table that just has codes so that I could not create a row that uses an id
or code
value or whatever that did not exist in my lookup table. Of course in R, our data is disconnected and non-relational. New users may exist in my dataset that weren’t there the last time I downloaded the users
table, for example.
Ok, so these are just collections of enumerated values
Yup! That’s right! In some ways like R’s beloved factors
, I want to have problems when my data contains values that don’t have a corresponding row in another data.frame
, just like trying to insert a value into a factor
that isn’t an existing level.
assertr
anticipates just this, with the in_set
helper. This way I can assert
that my data is in a defined set of values or get an error.
my_df <- data.frame(x = c(0,1,1,2))
assert(my_df, in_set(0,1), x)
# Column 'x' violates assertion 'in_set(0, 1)' 1 time
# index value
# 1 4 2
# Error: assertr stopped execution
Please Don’t stop()
By default, assert
raises an error with an incredibly helpful message. It tells you which column the assertion was on, what the assertion was, how many times that assertion failed, and then returns the column index and value of the failed cases.
Even better, assert
has an argument for error_fun
, which, combined with some built in functions, can allow for all kinds of fun behavior when an assertion fails. What if, for example, I actually want to collect that error message for later and not have a hard stop if an assertion failed?
By using error_append
, assert
will return the original data.frame
when there’s a failure with a special attribute called assertr_errors
that can be accessed later with all the information about failed assertions.
my_df %<>%
assert(in_set(0,1), x, error_fun = error_append) %>%
verify(x == 1, error_fun = error_append)
my_df
# x
# 1 0
# 2 1
# 3 1
# 4 2
attr(my_df, 'assertr_errors')
# [[1]]
# Column 'x' violates assertion 'in_set(0, 1)' 1 time
# index value
# 1 4 2
#
# [[2]]
# verification [x == 1] failed! (2 failures)
(Ok I cheated there folks. I used verify
, a new function from assertr
and a bunch of magrittr
pipes like %<>%
)
Enough with the toy examples
Ok, so here’s the code I wrote today. This started as a huge mess I ended up turning into two functions. First is_valid_fk
provides a straight forward way to get TRUE
or FALSE
on whether or not all of your codes/ids exist in a lookup data.frame
.
is_valid_fk <- function(data, key, values,
error_fun = error_logical,
success_fun = success_logical){
assert_(data, in_set(values), key,
error_fun = error_fun, success_fun = success_fun)
}
The first argument data
is your data.frame
, the second argument key
is the foreign key column in data
, and values
are all valide values for key
. Defaulting the error_fun
and success_fun
to *_logical
means a single boolean is the expected response.
But I don’t really want to do these one column at a time. I want to check if all of the foreign keys in a table are good to go. I also don’t want a boolean, I want to get back all the errors in a useable format. So I wrote all_valid_fk
.
Let’s take it one bit at a time.
all_valid_fk <- function(data, fk_list, id = 'code') {
data
is thedata.frame
we’re checking foreign keys in.fk_list
is a list ofdata.frames
. Each element is named for thekey
that it looks up; eachdata.frame
contains the valid values for thatkey
named…id
, the name of the column in eachdata.frame
in the listfk_list
that corresponds to the validkeys
.
verify(data, do.call(has_all_names, as.list(names(fk_list))))
Right away, I want to know if my data has all the values my fk_list
says it should. I have to do some do.call
magic because has_all_names
wants something like has_all_names('this', 'that', 'the_other')
not has_all_names(c('this', 'that', 'the_other')
.
The next part is where the magic happens.
accumulated_errors <- map(names(fk_list),
~ is_valid_fk(data,
key = .x,
values = fk_list[[.x]][[id]],
error_fun = error_append,
success_fun = success_continue)) %>%
map(attr, 'assertr_errors') %>%
reduce(append)
Using map
, I am able to call is_valid_fk
on each of the columns in data
that have a corresponding lookup table in fk_list
. The valid values are fk_list[[.x]][[id]]
, where .x
is the name of the data.frame
in fk_list
(which corresponds to the name of the code we’re looking up in data
and exists for sure, thanks to that verify
call) and id
is the name of the key in that data.frame
as stated earlier. I’ve replaced error_fun
and success_fun
so that the code does not exist map
as soon there are any problems. Instead, the data is returned for each assertion with the error attribute if one exists. 2 Immediately, map
is called on the resulting list of data.frame
s to collect the assertr_errors
, which are reduce
d using append
into a flattened list.
If there are no errors accumulated, accumulated_errors
is NULL
, and the function exits early.
if(is.null(accumulated_errors)) return(list())
I could have stopped here and returned all the messages in accumulated_errors
. But I don’t like all that text, I want something neater to work with later. The structure I decided on was a list of data.frame
s, with each element named for the column with the failed foreign key assertion and the contents being the index and value that failed the constraint.
By calling str
on data.frame
s returned by assertion, I was able to see that the index
and value
tables printed in the failed assert
messages are contained in error_df
. So next I extract each of those data.frame
s into a single list.
reporter <- accumulated_errors %>%
map('error_df') %>%
map(~ map_df(.x, as.character)) # because factors suck
I’m almost done. I have no way of identifying which column created each of those error_df
in reporter
. So to name each element based on the column that failed the foreign key contraint, I have to extract data from the message
attribute. Here’s what I came up with.
names(reporter) <- accumulated_errors %>%
map_chr('message') %>%
gsub("^Column \'([a-zA-Z]+)\' .*$", '\\1', x = .)
reporter
So let’s create some fake data and run all_valid_fk
to see the results:
> df <- data.frame(functions = c('1001','1002', '3001', '3002'),
objects = c('100','102', '103', '139'),
actuals = c(10000, 2431, 809, 50000),
stringsAsFactors = FALSE)
> chart <- list(functions = data.frame(code = c('1001', '1002', '3001'),
name = c('Foo', 'Bar', 'Baz'),
stringsAsFactors = FALSE),
objects = data.frame(code = c('100', '102', '103'),
name = c('Mom', 'Dad', 'Baby'),
stringsAsFactors = FALSE))
> all_valid_fk(data = df, fk_list = chart, id = 'code')
$functions
# A tibble: 1 × 2
index value
<chr> <chr>
1 4 3002
$objects
# A tibble: 1 × 2
index value
<chr> <chr>
1 4 139
Beautiful!
And here’s all_valid_fk
in one big chunk.
all_valid_fk <- function(data, fk_list, id = 'code') {
verify(data, do.call(has_all_names, as.list(names(fk_list))))
accumulated_errors <- map(names(fk_list),
~ is_valid_fk(data,
key = .x,
values = fk_list[[.x]][[id]],
error_fun = error_append,
success_fun = success_continue)) %>%
map(attr, 'assertr_errors') %>%
reduce(append)
if(is.null(accumulated_errors)) return(list())
reporter <- accumulated_errors %>%
map('error_df') %>%
map(~ map_df(.x, as.character))
names(reporter) <- accumulated_errors %>%
map_chr('message') %>%
gsub('Column \'(\\S*?)\'.*$', '\\1', x = .)
reporter
}
My thanks to Jonathan Carroll who was kind enough to read this post closely and actually tried to run the code. As a result, I’ve fixed a couple of typos and now have an improved regex pattern above.
-
I appear to have forgotten to build link post types into my Hugo blog, so the missing link from that post is here. ↩︎
-
I am a little concerned about memory here. Eight assertions would mean, at least briefly, eight copies of the same
data.frame
copied here without the need for that actual data. There is probably a better way. ↩︎
One of my favorite things to eat in Baltimore.
Feels like coming home.
“I’ve been waiting for you.”
Here’s a fun common task. I have a data set that has a bunch of codes like:
Name | Abbr | Code |
---|---|---|
Alabama | AL | 01 |
Alaska | AK | 02 |
Arizona | AZ | 04 |
Arkansas | AR | 05 |
California | CA | 06 |
Colorado | CO | 08 |
Connecticut | CT | 09 |
Delaware | DE | 10 |
All of your data is labeled with the code
value. In this case, you want to do a join
so that you can use the actual names because it’s 2017 and we’re not animals.
But what if your data, like the accounting data we deal with at Allovue, has lots of code fields. You probably either have one table that contains all of the look ups in “long” format, where there is a column that represents which column in your data the code is for like this:
code | type | name |
---|---|---|
01 | fips | Alabama |
02 | fips | Alaska |
Alternatively, you may have a lookup table per data element (so one called fips, one called account, one called function, etc).
I bet most folks do the following in this scenario:
account <- left_join(account, account_lookup)
account <- left_join(account, fips)
## Maybe this instead ##
account %<>%
left_join(account_lookup) %>%
left_join(fips)
I want to encourage you to do this a little different using purrr
. Here’s some annotated code that uses reduce_right
to make magic.
# Load a directory of .csv files that has each of the lookup tables
lookups <- map(dir('data/lookups'), read.csv, stringsAsFactors = FALSE)
# Alternatively if you have a single lookup table with code_type as your
# data attribute you're looking up
# lookups <- split(lookups, code_type)
lookups$real_data <- read.csv('data/real_data.csv',
stringsAsFactors = FALSE)
real_data <- reduce_right(lookups, left_join)
Boom, now you went from data with attributes like funds_code
, function_code
, state_code
to data that also has funds_name
, function_name
, state_name
1. What’s great is that this same code can be reused no matter how many fields require a hookup. I’m oftent dealing with accounting data where the accounts are defined by a different number of data fields, but my code doesn’t care at all.
-
My recommendation is to use consistent naming conventions like
_code
and_name
so that knowing how to do the lookups is really straightforward. This is not unlike the convention with Microsoft SQL where the primary key of a table is namedId
and a foreign key to that table is namedTableNameId
. Anything that helps you figure out how to put things together without thinking is worth it. ↩︎
One of my goals for 2017 is to contribute more to the R open source community. At the beginning of last year, I spent a little time helping to refactor rio. It was one of the more rewarding things I did in all of 2016. It wasn’t a ton of work, and I feel like I gained a lot of confidence in writing R packages and using S3 methods. I wrote code that R users download and use thousands of times a month.
I have been on the lookout for a Javascript powered interactive charting library since ggvis
was announced in 2014. But ggvis
seems to have stalled out in favor of other projects (for now) and the evolution of rCharts
into htmlwidgets
left me feeling like there were far too many options and no clear choices.
What I was looking for was a plotting library to make clean, attractive graphics with tool tips that came with clear documentation and virtually no knowledge of Javascript required. Frankly, all of the htmlwidgets
stuff was very intimidating. From my vantage point skimming blog posts and watching stuff come by on Twitter, htmlwidgets
-based projects all felt very much directed at Javascript polyglots.
Vega and Vega-Lite had a lot of the qualities I sought in a plotting library. Reading and writing JSON is very accessible compared to learning Javascript, especially with R’s excellent translation from lists to JSON. And although I know almost no Javascript, I found in both Vega and Vega-Lite easy to understand documents that felt a lot like building grammar of graphics 1 plots.
So I decided to take the plunge– there was a vegalite
package and the examples didn’t look so bad. It was time to use my first htmlwidgets
package.
Things went great. I had some simple data and I wanted to make a bar chart. I wrote:
vegalite() %>%
add_data(my_df) %>%
encode_x('schools', type = 'nominal') %>%
encode_y('per_pupil', type = 'quantitative') %>%
mark_bar()
A bar chart was made! But then I wanted to use the font Lato, which is what we use at Allovue. No worries, Vega-Lite has a property called titleFont
for axes. So I went to do:
vegalite() %>%
add_data(my_df) %>%
encode_x('schools', type = 'nominal') %>%
encode_y('per_pupil', type = 'quantitative') %>%
mark_bar() %>%
axis_x(titleFont = 'Lato')
Bummer. It didn’t work. I almost stopped there, experiment over. But then I remembered my goal and I thought, maybe I need to learn to contribute to a package that is an htmlwidget
and not simply use an htmlwidget
-based package. I should at least look at the code.
What I found surprised me. Under the hood, all the R package does is build up lists. It makes so much sense– pass JSON to Javascript to process and do what’s needed.
So it turned out, vegalite
for R was a bit behind the current version of vegalite
and didn’t have the titleFont
property yet. And with that, I made my first commit. All I had to do was update the function definition and add the new arguments to the axis data like so:
if (!is.null(titleFont)) vl$x$encoding[[chnl]]$axis$titleFont <- titleFont
But why stop there? I wanted to update all of vegalite
to use the newest available arguments. Doing so looked like a huge pain though. The original package author made these great functions like axis_x
and axis_y
. They both had the same arguments, the only difference was the “channel” was preset as x
or y
based on which function was called. Problem was that all of the arguments, all of the assignments, and all of the documentation had to be copied twice. It was worse with encode
and scale
which had many, many functions that are similar or identical in their “signature”. No wonder the package was missing so many Vega-Lite features– they were a total pain to add.
So as a final step, I decided I would do a light refactor across the whole package. In each of the core functions, like encode
and axis
, I would write a single generic function like encode_vl()
that would hold all of the possible arguments for the encoding portion of Vega-Lite. Then the specific functions like encode_x
could become wrapper functions that internally call encode_vl
like so:
encode_x <- function(vl, ...) {
vl <- encode_vl(vl, chnl = "x", ...)
vl
}
encode_y <- function(vl, ...) {
vl <- encode_vl(vl, chnl ="y", ...)
vl
}
encode_color <- function(vl, ...) {
vl <- encode_vl(vl, chnl = "color", ...)
vl
}
Now, in order to update the documentation and the arguments for encoding
, I just have to update the encode_vl
function. It’s a really nice demonstration, in my opinion, of the power of R’s ...
syntax. All of the wrapper functions can just pass whatever additional arguments the caller wants to encode_vl
without having to explicitly list them each time.
This greatly reduced duplication in the code and made it far easier to update vegalite
to the newest version of Vega-Lite, which I also decided to do.
Now Vega-Lite itself is embarking on a 2.0 release that I have a feeling will have some pretty big changes in store. I’m not sure if I’ll be the one to update vegalite
– in the end, I think that Vega-Lite is too simple for the visualizations I need to do– but I am certain whoever does the update will have a much easier go of it now than they would have just over a month ago.
Thanks to Bob Rudis for making vegalite
and giving me free range after a couple of commits to go hog-wild on his package!
-
The
gg
inggplot2
. ↩︎
I was not!
Just a night on the town in Portland.
You can check my Goodreads profile. I love science fiction and fantasy. And I know in 2017 and everyone has already observed the dominance of “geek culture”, with the dominance of Disney properties from Marvel and now Star Wars. Hell, Suicide Squad won a goddamn Oscar.
But I never felt like SFF was all that mainstream. SyFy might have made (and renewed) a TV series based on The Magicians, but I still feel like the disaffected entitled shit that held onto his love of genre fiction too long when I crawl into bed and hide in speculative fiction (thank you Quentin, for so completely capturing what a shit I was at 14).
Yesterday, I was confronted with the reality of SFF going mainstream at Powell’s City of Books. I was fully unprepared to see the contents of their Best Selling Fiction shelf.
By my count, at least 16 of the top 42 are SFF. The Name of the Wind, The Left Hand of Darkness, The Fifth Season, 2312, and Uprooted are some of the best books I’ve ready in the last four or five years. To think of these books as best sellers when they don’t have a TV show coming out (like American Gods, The Handmaid’s Tale, The Man in the High Castle, and The Magicians) and aren’t assigned in high school classrooms (1984, Slaughterhouse-Five) is just shocking. In my mind, these aren’t best sellers, they’re tiny nods between myself and other quiet bookshoppers that we are kin.
I am not sad though. I am thrilled. I want to live in a world where I can just assume acquaintances are reading The Fifth Season and Uprooted.
This is a part of the march of progress that seemed certain to triumph a few months ago.
Chilaquiles, always, everywhere.
Has a bookstore ever known it’s sci-fi/fantasy audience this well?
Grilling in the dark because our courtyard lights are broken.
70 degrees? We’re grilling.
Going to grandma and grandpa’s house for rehab.
Star Wars was on when we got to grandma and grandpa’s!
Stoned post-surgery