April 9, 2017

3:29PM

maiz y cacao. @westminsterfox is like one of three POC sitting here. #stilldelicious

April 5, 2017

5:42PM

April 3, 2017

6:53AM

Now it’s spring.

April 2, 2017

Foreign Keys and assertr

4:19PM

I have been fascinated with assertive programming in R since this from 2015 ¹. Tony Fischetti wrote a great blog post to announce assertr 2.0’s release on CRAN that really clarified the package’s design.

UseRs often do crazy things that no sane developer in another language would do. Today I decided to build a way to check foreign key constraints in R to help me learn the assertr package.

What do you mean, foreign key constraints?

Well, in many ways this is an extension of my last post on using purrr::reduce. I have a set of data with codes (like FIPS codes, or user ids, etc) and I want to make sure that all of those codes are “real” codes (as in I have a defintion for that value). So I may have a FIPS code data.frame with fips_code and name as the columns or a user data.frame with columns id, fname, lname, email.

In a database, I might have a foreign key constraint on my table that just has codes so that I could not create a row that uses an id or code value or whatever that did not exist in my lookup table. Of course in R, our data is disconnected and non-relational. New users may exist in my dataset that weren’t there the last time I downloaded the users table, for example.

Ok, so these are just collections of enumerated values

Yup! That’s right! In some ways like R’s beloved factors, I want to have problems when my data contains values that don’t have a corresponding row in another data.frame, just like trying to insert a value into a factor that isn’t an existing level.

assertr anticipates just this, with the in_set helper. This way I can assert that my data is in a defined set of values or get an error.

my_df <- data.frame(x = c(0,1,1,2))
assert(my_df, in_set(0,1), x)
# Column 'x' violates assertion 'in_set(0, 1)' 1 time
#   index value
# 1     4     2
# Error: assertr stopped execution

Please Don’t stop()

By default, assert raises an error with an incredibly helpful message. It tells you which column the assertion was on, what the assertion was, how many times that assertion failed, and then returns the column index and value of the failed cases.

Even better, assert has an argument for error_fun, which, combined with some built in functions, can allow for all kinds of fun behavior when an assertion fails. What if, for example, I actually want to collect that error message for later and not have a hard stop if an assertion failed?

By using error_append, assert will return the original data.frame when there’s a failure with a special attribute called assertr_errors that can be accessed later with all the information about failed assertions.

my_df %<>%
  assert(in_set(0,1), x, error_fun = error_append) %>%
  verify(x == 1, error_fun = error_append)
my_df
#   x
# 1 0
# 2 1
# 3 1
# 4 2
attr(my_df, 'assertr_errors')
# [[1]]
# Column 'x' violates assertion 'in_set(0, 1)' 1 time
#   index value
# 1     4     2
# 
# [[2]]
# verification [x == 1] failed! (2 failures)

(Ok I cheated there folks. I used verify, a new function from assertr and a bunch of magrittr pipes like %<>%)

Enough with the toy examples

Ok, so here’s the code I wrote today. This started as a huge mess I ended up turning into two functions. First is_valid_fk provides a straight forward way to get TRUE or FALSE on whether or not all of your codes/ids exist in a lookup data.frame.

is_valid_fk <- function(data, key, values,
                        error_fun = error_logical,
                        success_fun = success_logical){

  assert_(data, in_set(values), key,
          error_fun = error_fun, success_fun = success_fun)

}

The first argument data is your data.frame, the second argument key is the foreign key column in data, and values are all valide values for key. Defaulting the error_fun and success_fun to *_logical means a single boolean is the expected response.

But I don’t really want to do these one column at a time. I want to check if all of the foreign keys in a table are good to go. I also don’t want a boolean, I want to get back all the errors in a useable format. So I wrote all_valid_fk.

Let’s take it one bit at a time.

all_valid_fk <- function(data, fk_list, id = 'code') {

data is the data.frame we’re checking foreign keys in.
fk_list is a list of data.frames. Each element is named for the key that it looks up; each data.frame contains the valid values for that key named…
id, the name of the column in each data.frame in the list fk_list that corresponds to the valid keys.

verify(data, do.call(has_all_names, as.list(names(fk_list))))

Right away, I want to know if my data has all the values my fk_list says it should. I have to do some do.call magic because has_all_names wants something like has_all_names('this', 'that', 'the_other') not has_all_names(c('this', 'that', 'the_other').

The next part is where the magic happens.

accumulated_errors <- map(names(fk_list),
                            ~ is_valid_fk(data,
                                          key = .x,
                                          values = fk_list[[.x]][[id]],
                                          error_fun = error_append,
                                          success_fun = success_continue)) %>%
                        map(attr, 'assertr_errors') %>%
                        reduce(append)

Using map, I am able to call is_valid_fk on each of the columns in data that have a corresponding lookup table in fk_list. The valid values are fk_list[[.x]][[id]], where .x is the name of the data.frame in fk_list (which corresponds to the name of the code we’re looking up in data and exists for sure, thanks to that verify call) and id is the name of the key in that data.frame as stated earlier. I’ve replaced error_fun and success_fun so that the code does not exist map as soon there are any problems. Instead, the data is returned for each assertion with the error attribute if one exists. ² Immediately, map is called on the resulting list of data.frames to collect the assertr_errors, which are reduced using append into a flattened list.

If there are no errors accumulated, accumulated_errors is NULL, and the function exits early.

if(is.null(accumulated_errors)) return(list())

I could have stopped here and returned all the messages in accumulated_errors. But I don’t like all that text, I want something neater to work with later. The structure I decided on was a list of data.frames, with each element named for the column with the failed foreign key assertion and the contents being the index and value that failed the constraint.

By calling str on data.frames returned by assertion, I was able to see that the index and value tables printed in the failed assert messages are contained in error_df. So next I extract each of those data.frames into a single list.

reporter <- accumulated_errors %>%
            map('error_df') %>%
            map(~ map_df(.x, as.character)) # because factors suck

I’m almost done. I have no way of identifying which column created each of those error_df in reporter. So to name each element based on the column that failed the foreign key contraint, I have to extract data from the message attribute. Here’s what I came up with.

names(reporter) <- accumulated_errors %>%
                   map_chr('message') %>%
                   gsub("^Column \'([a-zA-Z]+)\' .*$", '\\1', x = .)
reporter

So let’s create some fake data and run all_valid_fk to see the results:

> df <- data.frame(functions = c('1001','1002', '3001', '3002'),
                   objects = c('100','102', '103', '139'),
                   actuals = c(10000, 2431, 809, 50000),
                   stringsAsFactors = FALSE)

> chart <- list(functions = data.frame(code = c('1001', '1002', '3001'),
                                       name = c('Foo', 'Bar', 'Baz'),
                                       stringsAsFactors = FALSE),
                objects =   data.frame(code = c('100', '102', '103'),
                                       name = c('Mom', 'Dad', 'Baby'),
                                       stringsAsFactors = FALSE))
> all_valid_fk(data = df, fk_list = chart, id = 'code')
$functions
# A tibble: 1 × 2
  index value
  <chr> <chr>
1     4  3002

$objects
# A tibble: 1 × 2
  index value
  <chr> <chr>
1     4   139

Beautiful!

And here’s all_valid_fk in one big chunk.

all_valid_fk <- function(data, fk_list, id = 'code') {
  verify(data, do.call(has_all_names, as.list(names(fk_list))))

  accumulated_errors <- map(names(fk_list),
                            ~ is_valid_fk(data,
                                          key = .x,
                                          values = fk_list[[.x]][[id]],
                                          error_fun = error_append,
                                          success_fun = success_continue)) %>%
                        map(attr, 'assertr_errors') %>%
                        reduce(append)

  if(is.null(accumulated_errors)) return(list())

  reporter <- accumulated_errors %>%
              map('error_df') %>%
              map(~ map_df(.x, as.character))

  names(reporter) <- accumulated_errors %>%
                     map_chr('message') %>%
                     gsub('Column \'(\\S*?)\'.*$', '\\1', x = .)
  reporter
}

My thanks to Jonathan Carroll who was kind enough to read this post closely and actually tried to run the code. As a result, I’ve fixed a couple of typos and now have an improved regex pattern above.

I appear to have forgotten to build link post types into my Hugo blog, so the missing link from that post is here. ↩︎
I am a little concerned about memory here. Eight assertions would mean, at least briefly, eight copies of the same data.frame copied here without the need for that actual data. There is probably a better way. ↩︎

March 29, 2017

4:53PM

One of my favorite things to eat in Baltimore.

5:41PM

March 23, 2017

4:07AM

Feels like coming home.

March 12, 2017

7:04PM

March 2, 2017

12:08PM

“I’ve been waiting for you.”

Labeling Data with purrr

7:00PM

Here’s a fun common task. I have a data set that has a bunch of codes like:

Name	Abbr	Code
Alabama	AL	01
Alaska	AK	02
Arizona	AZ	04
Arkansas	AR	05
California	CA	06
Colorado	CO	08
Connecticut	CT	09
Delaware	DE	10

All of your data is labeled with the code value. In this case, you want to do a join so that you can use the actual names because it’s 2017 and we’re not animals.

But what if your data, like the accounting data we deal with at Allovue, has lots of code fields. You probably either have one table that contains all of the look ups in “long” format, where there is a column that represents which column in your data the code is for like this:

code	type	name
01	fips	Alabama
02	fips	Alaska

Alternatively, you may have a lookup table per data element (so one called fips, one called account, one called function, etc).

I bet most folks do the following in this scenario:

account <- left_join(account, account_lookup)
account <- left_join(account, fips)

## Maybe this instead ##
account %<>%
  left_join(account_lookup) %>%
  left_join(fips)

I want to encourage you to do this a little different using purrr. Here’s some annotated code that uses reduce_right to make magic.

# Load a directory of .csv files that has each of the lookup tables
lookups <- map(dir('data/lookups'), read.csv, stringsAsFactors = FALSE)
# Alternatively if you have a single lookup table with code_type as your
# data attribute you're looking up
# lookups <- split(lookups, code_type)
lookups$real_data <- read.csv('data/real_data.csv', 
                              stringsAsFactors = FALSE)
real_data <- reduce_right(lookups, left_join)

Boom, now you went from data with attributes like funds_code, function_code, state_code to data that also has funds_name, function_name, state_name¹. What’s great is that this same code can be reused no matter how many fields require a hookup. I’m oftent dealing with accounting data where the accounts are defined by a different number of data fields, but my code doesn’t care at all.

My recommendation is to use consistent naming conventions like _code and _name so that knowing how to do the lookups is really straightforward. This is not unlike the convention with Microsoft SQL where the primary key of a table is named Id and a foreign key to that table is named TableNameId. Anything that helps you figure out how to put things together without thinking is worth it. ↩︎

March 1, 2017

Contributing to vegalite

7:00PM

One of my goals for 2017 is to contribute more to the R open source community. At the beginning of last year, I spent a little time helping to refactor rio. It was one of the more rewarding things I did in all of 2016. It wasn’t a ton of work, and I feel like I gained a lot of confidence in writing R packages and using S3 methods. I wrote code that R users download and use thousands of times a month.

I have been on the lookout for a Javascript powered interactive charting library since ggvis was announced in 2014. But ggvis seems to have stalled out in favor of other projects (for now) and the evolution of rCharts into htmlwidgets left me feeling like there were far too many options and no clear choices.

What I was looking for was a plotting library to make clean, attractive graphics with tool tips that came with clear documentation and virtually no knowledge of Javascript required. Frankly, all of the htmlwidgets stuff was very intimidating. From my vantage point skimming blog posts and watching stuff come by on Twitter, htmlwidgets-based projects all felt very much directed at Javascript polyglots.

Vega and Vega-Lite had a lot of the qualities I sought in a plotting library. Reading and writing JSON is very accessible compared to learning Javascript, especially with R’s excellent translation from lists to JSON. And although I know almost no Javascript, I found in both Vega and Vega-Lite easy to understand documents that felt a lot like building grammar of graphics ¹ plots.

So I decided to take the plunge– there was a vegalite package and the examples didn’t look so bad. It was time to use my first htmlwidgets package.

Things went great. I had some simple data and I wanted to make a bar chart. I wrote:

vegalite() %>%
add_data(my_df) %>%
encode_x('schools', type = 'nominal') %>%
encode_y('per_pupil', type = 'quantitative') %>%
mark_bar()

A bar chart was made! But then I wanted to use the font Lato, which is what we use at Allovue. No worries, Vega-Lite has a property called titleFont for axes. So I went to do:

vegalite() %>%
add_data(my_df) %>%
encode_x('schools', type = 'nominal') %>%
encode_y('per_pupil', type = 'quantitative') %>%
mark_bar() %>%
axis_x(titleFont = 'Lato')

Bummer. It didn’t work. I almost stopped there, experiment over. But then I remembered my goal and I thought, maybe I need to learn to contribute to a package that is an htmlwidget and not simply use an htmlwidget-based package. I should at least look at the code.

What I found surprised me. Under the hood, all the R package does is build up lists. It makes so much sense– pass JSON to Javascript to process and do what’s needed.

So it turned out, vegalite for R was a bit behind the current version of vegalite and didn’t have the titleFont property yet. And with that, I made my first commit. All I had to do was update the function definition and add the new arguments to the axis data like so:

if (!is.null(titleFont))    vl$x$encoding[[chnl]]$axis$titleFont <- titleFont

But why stop there? I wanted to update all of vegalite to use the newest available arguments. Doing so looked like a huge pain though. The original package author made these great functions like axis_x and axis_y. They both had the same arguments, the only difference was the “channel” was preset as x or y based on which function was called. Problem was that all of the arguments, all of the assignments, and all of the documentation had to be copied twice. It was worse with encode and scale which had many, many functions that are similar or identical in their “signature”. No wonder the package was missing so many Vega-Lite features– they were a total pain to add.

So as a final step, I decided I would do a light refactor across the whole package. In each of the core functions, like encode and axis, I would write a single generic function like encode_vl() that would hold all of the possible arguments for the encoding portion of Vega-Lite. Then the specific functions like encode_x could become wrapper functions that internally call encode_vl like so:

encode_x <- function(vl, ...) {
  vl <- encode_vl(vl, chnl = "x", ...)
  vl
}

encode_y <- function(vl, ...) {
  vl <- encode_vl(vl, chnl ="y", ...)
  vl
}

encode_color <- function(vl, ...) {
  vl <- encode_vl(vl, chnl = "color", ...)
  vl
}

Now, in order to update the documentation and the arguments for encoding, I just have to update the encode_vl function. It’s a really nice demonstration, in my opinion, of the power of R’s ... syntax. All of the wrapper functions can just pass whatever additional arguments the caller wants to encode_vl without having to explicitly list them each time.

This greatly reduced duplication in the code and made it far easier to update vegalite to the newest version of Vega-Lite, which I also decided to do.

Now Vega-Lite itself is embarking on a 2.0 release that I have a feeling will have some pretty big changes in store. I’m not sure if I’ll be the one to update vegalite– in the end, I think that Vega-Lite is too simple for the visualizations I need to do– but I am certain whoever does the update will have a much easier go of it now than they would have just over a month ago.

Thanks to Bob Rudis for making vegalite and giving me free range after a couple of commits to go hog-wild on his package!

The gg in ggplot2. ↩︎

February 28, 2017

2:28PM

I was not!

8:44PM

Just a night on the town in Portland.

February 27, 2017

My confrontation with the mainstream

8:02AM

You can check my Goodreads profile. I love science fiction and fantasy. And I know in 2017 and everyone has already observed the dominance of “geek culture”, with the dominance of Disney properties from Marvel and now Star Wars. Hell, Suicide Squad won a goddamn Oscar.

Powell's Best Selling Fiction, 2017-02-26

But I never felt like SFF was all that mainstream. SyFy might have made (and renewed) a TV series based on The Magicians, but I still feel like the disaffected entitled shit that held onto his love of genre fiction too long when I crawl into bed and hide in speculative fiction (thank you Quentin, for so completely capturing what a shit I was at 14).

Yesterday, I was confronted with the reality of SFF going mainstream at Powell’s City of Books. I was fully unprepared to see the contents of their Best Selling Fiction shelf.

By my count, at least 16 of the top 42 are SFF. The Name of the Wind, The Left Hand of Darkness, The Fifth Season, 2312, and Uprooted are some of the best books I’ve ready in the last four or five years. To think of these books as best sellers when they don’t have a TV show coming out (like American Gods, The Handmaid’s Tale, The Man in the High Castle, and The Magicians) and aren’t assigned in high school classrooms (1984, Slaughterhouse-Five) is just shocking. In my mind, these aren’t best sellers, they’re tiny nods between myself and other quiet bookshoppers that we are kin.