February 5, 2018
February 4, 2018

I am shocked at how slow a fairly straightforward DELETE is on PostgreSQL. Stupid nested data with 1:M relationship. This seems absurd:

DELETE 25560
Time: 262387.789 ms

(cascades to 76680 records in another table as well in this example)

February 3, 2018

I believe in owning my own space on the web. I have had some form of a blog since LiveJournal, but frequently burn it down to the ground. For a while I’ve maintained a static site, first using Pelican and now Hugo/blogdown. I’ve never been happy with my post frequency, yet I now have over 60,000 tweets. After months of waffling and considering launching my own micro blog using Hugo, I just decided I’d rather pay @manton and get it up and running. If microblogging is the format that keeps me writing, it’s time to not just embrace it, but to support the kind of microblogging that I believe in. Off to figure out how to point micro.json.blog here.

Missed my GoodReads challenge 2 years in a row. So I started this year by breaking one of my rules and doing a couple of rereads— The Subtle Knife & The Amber Spyglass. Hadn’t read either since they were published. Now I’m on my sixth book this year. 💪🏻📚

January 24, 2018

I have some text, but I want the content of that text to be dynamic based on data. This is a case for string interpolation. Lots of languages have the ability to write something like

pet = "dog"
puts "This is my {#pet}"
pet = "dog"
print(f"This is my {pet}")

There have been ways to do this in R, but I’ve mostly hated them until glue came along. Using glue in R should look really familiar now:

pet <- "dog"
glue("This is my {pet}")

Awesome! Now I have a way to make text bend to my bidding using data. But this is pretty simple, and we could have just used something like paste("This is my", pet) and been done with it.

Let me provide a little motivation in the form of data.frames, glue_data, and some purrr.

Pretend we have a field in a database called notes. I want to set the notes for each entity to follow the same pattern, but use other data to fill in the blanks. Like maybe something like this:

notes <- "This item price is valid through {end_date} and will then increase {price_change} to {new_price}."

This is a terrible contrived example, but we can imagine displaying this note to someone with different content for each item. Now in most scenarios, the right thing to do for an application is to produce this content dynamically based on what’s in the database, but let’s pretend no one looked far enough ahead to store this data or like notes can serve lots of different purposes using different data. So there is no place for the application to find end_date, price_change, or new_price in its database. Instead, this was something prepared by sales in Excel yesterday and they want these notes added to all items to warn their customers.

Here’s how to take a table that has item_id, end_date, price_change, and new_price as columns and turn it into a table with item_id, and notes as columns, with your properly formatted note for each item to be updated in a database.

library(glue)
library(purrr)

item_notes <- data.frame(
  item_id = seq_len(10),
  end_date = c(rep(as.Date('2018-03-01', format = '%Y-%m-%d'), 5),
               rep(as.Date('2018-03-05', format = '%Y-%m-%d'), 3),
               rep(as.Date('2018-03-09', format = '%Y-%m-%d'), 2)),
  price_change = sample(x = seq_len(5),replace = TRUE,size = 10),
  new_price = sample(x = 10:20,replace = TRUE,size = 10)
)

template <- "This item price is valid through {end_date} and will then increase {price_change} to {new_price}."

map_chr(split(item_notes, item_notes$item_id), 
    glue_data, 
    template) %>% 
stack() %>% 
rename(item_id = ind,
       notes = values)

What’s going on here? First, I want to apply my glue technique to rows of a data.frame, so I split the data into a list using item_id as the identifier. That’s because at the end of all this I want to preserve that id to match back up in a database. 1 The function glue_data works like glue, but it accepts things that are “listish” as it’s first argument (like data.frames and named lists). So with handy map over my newly created list of “listish” data, I create a named list with the text I wanted to generate. I then use a base R function that’s new to me stack, which will take a list and make each element a row in a data.frame with ind as the name of the list element and values as the value.

Now I’ve got a nice data.frame, ready to be joined with any table that has item_id so it can have the attached note!


  1. You can split on row.names if you don’t have a similar identifer and just want to go from data.frame to a list of your rows. ↩︎

January 2, 2018

I have been using ggplot2 for 7 years I think. In all that time, I’ve been frustrated that I can never figure out what order to put my color values in for scale_*_manual. Not only is the order mapping seemingly random to me, I know that sometimes if I change something about how I’m treating the data, the order switches up.

Countless hours could have been saved if I knew that this one, in hindsight, obvious thing was possible.

Whenever using scale_*_manual, you can directly reference a color using a character vector and then name your value in the scale_ call like so:

geom_blah(aes(color = 'good')) +
geom_blah(aes(color = 'bad')) +
scale_blah_manual(values = c(good = 'green', bad = 'red'))

Obviously this is a toy example, but holy game changer.

December 30, 2017

Looking back on 2017, there were three major trends in my R code: the end of S4, directly writing to SQL database, and purrr everywhere.

The End of S4

The first package I ever wrote extensively used S4 classes. I wanted to have the security of things like setValidity. I liked the idea of calling new as it felt more like class systems I was familiar with from that one semester of Java in college. S4 felt more grown up than S3, more like it was utilizing the advantages of object oriented programming, and less exotic than R6, which in 2014 felt riskier to build with and teach future employees. Using S4 was a mistake from day one and never led to any advantages in the code I wrote.

So this year, I rewrote that original package. It’s internal (and a core function) at my job so I can’t share too much, but this was a long time coming. Not only did I clean up a lot of code that was just plain bad (in the way all old code is), but I got rid of S4 in favor of S3 or more functional code wherever possible. Our test coverage is far more complete, the code is far easier to extend without duplication, and it looks far more idiomatic to the standard non-BioConductor R user.

What’s the lesson learned here? From a technical perspective, it would be avoid premature optimization and, of course, that everyone can and wants to throw out old code they revist with greater knowledge and context. But I know those things. What drove me to making the wrong decision here was purely imposter syndrome. I was writing code that had to be run unattended on a regular basis as a part of a product in a new job. I didn’t feel up to the task, so I felt working with a new, complex, scary part of R that promised some notion of “safety” would mean I really knew what I was doing. So my takeaway from walking away from S4 is this: start small, build what you know, have confidence you can solve problems one at a time, and trust yourself.

Directly Writing SQL

I use SQL far more than R, but almost entirely as a consumer (e.g. SELECT only). I’ve almost always directly used SQL for my queries into other people’s data, but rarely ventured into the world of INSERT or UPDATE directly, preferring to use interfaces like dbWriteTable. This gets back to imposter syndrome– there’s so little damage that can be done with a SELECT statement, but writing into databases I don’t control means taking on risk and responsiblity.

This year I said fuck it– there’s a whole lot of work and complexity going on that’s entirely related to me not wanting to write INSERT INTO a PostgreSQL has the amazing ON CONFLICT...-based “upserts” now. So I started to write a lot of queries, some of them pretty complex 1. R is a great wrapper language, and it’s database story is getting even better with the new DBI, odbc, and RPostgres packages. Although it’s native table writing support is a little weak, there’s no problem at all just using dbSendStatement with complex queries. I’ve fallen into a pattern I really like of writing temporary tables (with dplyr::copy_to because it’s clean in a pipeline) and then executing complex SQL with dbSendStatement. In the future, I might be inclined to make these database functions, but either way this change has been great. I feel more confident than ever working with databases and R (my two favorite places to be) and I have been able to simplify a whole lot of code that involved passing around text files (and boy do I hate the type inference and other madness that can happen with CSVs. Oy.).

purrr

This is the year that purrr not only clicked, but became my preferred way to write code. Where there was apply, now there was purrr. Everything started to look like a list. I’m still only scratching the surface here, but I love code like this:

locations %>%
  filter(code %in% enrollment$locations) %$%
  code %>%
  walk(function(x) render(input = 'schprofiles.Rmd',
                          html_document(theme = NULL,
                                        template = NULL,
                                        self_contained = FALSE,
                                        css = 'static/styles.css',
                                        lib_dir = 'cache/demo/output/static/',
                                        includes = includes('fonts.html')),
                          params = list(school_code = x),
                          output_file = paste0(x,'.html'),
                          output_dir = "cache/demo/output/"))

It’s a simple way to run through all of the locations (a data.frame with columns code and name) and render an HTML-based profile of each school (defined by having student enrollment). walk is beautiful, and so is purrr. I mean, who does need to do map(., mutate_if, is.numeric, as.character) 10 times a day?

2018 R Goals

One thing that’s bittersweet is that 2017 is probably the last year in a long time that writing code is the main thing my every day job is about. With increased responsibility and the growth of my employees, I find myself reviewing code a lot more than writing it, and sometimes not even that. With that in mind, I have a few goals for 2018 that I hope will keep the part of me that loves R engaged.

First, I want to start writing command line utilities using R. I know almost nothing beyond Rscript -e or ./script.sh when it comes to writing a CLI. But there are all kinds of tasks I do every day that could be written as small command line scripts. Plus, my favorite part of package authoring is writing interfaces for other people to use. How do I expect someone to want to use R and reason about a problem I’m helping to solve? It’s no wonder that I work on product every day with this interest. So I figure one way to keep engaged in R is to learn how to design command line utilities in R and get good at it. Rather than write R code purely intended to be called and used from R, my R code is going to get an interface this year.

Like every year, I’d like to keep up with this blog. I never do, but this year I had a lot of encouraging signs. I actually got considerable attention for every R-related post (high hundreds of views), so I think it’s time to lean into that. I’m hoping to write one R related post each week. I think the focus will help me have some chance of pulling this off. Since I also want to keep my R chops alive while I move further and further away from day to day programming responsibilities, it should be a two birds with one stone scenario. One major thing I haven’t decided– do I want to submit to r-bloggers? I’m sure it’d be a huge source of traffic, but I find it frustrating to have to click through from my RSS reader of choice when finding things there.

Lastly, I’d like to start to understand the internals of a core package I use every day. I haven’t decided what that’ll be. Maybe it’ll be something really fundamental like dplyr, DBI, or ggplot2. Maybe it’ll be something “simpler”. But I use a lot more R code than I read. And one thing I’ve learned every time I’ve forced myself to dig in is that I understand more R than I thought and also that reading code is one of the best ways to learn more. I want to do at least one deep study that advances my sense of self-R-worth. Maybe I’ll even have to take the time to learn a little C++ and understand how Rccp is being used to change the R world.

Special Thanks

The #rstats world on Twitter has been the only reason I can get on that service anymore. It’s a great and positive place where I learn a ton and I really appreciate feeling like there is a family of nerds out there talking about stuff that I feel like no one should care about. My tweets are mostly stupid musings that come to me and retweeting enraging political stuff in the dumpster fire that is Trump’s America, so I’m always surprised and appreciative that anyone follows me. It’s so refreshing to get away from that and just read #rstats. So thank you for inspiring me and teaching me and being a fun place to be.


  1. I let out quite the “fuck yea!” when I got that two-common-table-expression, two joins with one lateral join in an upsert query to work. ↩︎

December 16, 2017
November 13, 2017

Hard to complain about the traveling I do when I get to see this outside of the window. Off to Miami.

November 3, 2017
October 1, 2017

Feels like fall, but it looks like summer.

September 12, 2017

Killer title at front desk of a new @allovuebalance district

September 9, 2017

Pork tenderloin is inexpensive, low fat, high protein. Great for watching weight. The lack of fat can make them tricky to cook, but sous vide at 130 makes it easy to make amazing, tender, and juicy pork a week’s worth at a time.

August 4, 2017

My dogs have an Instagram now. #furbaby @gracie.and.brandy

June 11, 2017
May 31, 2017

Still my favorite spot, especially when I’m feeling down and out.

I have not yet spent the time to figure out how to generate a JSON feed in Hugo yet. But I have built an R package to play with JSON feeds. It’s called jsonfeedr, and it’s silly simple.

Maybe I’ll extend this in the future. I hope people will submit PRs to expand it. For now, I was inspired by all the talk about why JSON feed even exists. Working with JSON is fun and easy. Working with XML is not.

Anyway, I figured the guy who registered json.blog should have a package out there working with JSON.

May 24, 2017

Finishes for the new house. Colors are a bit off in photos. In my defense, this is after like 6hrs in the home gallery.

May 17, 2017

Non-standard evaluation is one of R’s best features, and also one of it’s most perplexing. Recently I have been making good use of wrapr::let to allow me to write reusable functions without a lot of assumptions about my data. For example, let’s say I always want to group_by schools when adding up dollars spent, but that sometimes my data calls what is conceptually a school schools, school, location, cost_center, Loc.Name, etc. What I have been doing is storing a set of parameters in a list that mapped the actual names in my data to consistent names I want to use in my code. Sometimes that comes from using params in an Rmd file. So the top of my file may say something like:

params:
    school: "locations"
    amount: "dollars"
    enrollment: n

In my code, I may want to write a chain like

create_per_pupil <- . %>%
                    group_by(school) %>%
                    summarize(per_pupil = sum(amount) / n)
pp <- district_data %>%
      create_per_pupil

Only my problem is that school isn’t always school. In this toy case, you could use group_by_(params$school), but it’s pretty easy to run into limitations with the _ functions in dplyr when writing functions.

Using wrapr::let, I can easily use the code above:

let(alias = params, {
    create_per_pupil <- . %>%
                        group_by(school) %>%
                        summarize(per_pupil = sum(amount)/n)
})

pp <- district_data %>%
      create_per_pupil

The core of wrapr::let is really scary.

body <- strexpr
for (ni in names(alias)) {
    value <- as.character(alias[[ni]])
    if (ni != value) {
        pattern <- paste0("\\b", ni, "\\b")
        body <- gsub(pattern, value, body)
    }
}
parse(text = body)

Basically let is holidng onto the code block contained within it, iterating over the list of key-value pairs that are provided, and then runs a gsub on word boundaries to replace all instances of the list names with their values. Yikes.

This works, I use it all over, but I have never felt confident about it.

The New World of tidyeval

The release of dplyr 0.6 along with tidyeval brings wtih it a ton of features to making programming over dplyr functions far better supported. I am going to read this page by Hadley Wickham at least 100 times. There are all kinds of new goodies (!!! looks amazing).

So how would I re-write the chain above sans let?

create_per_pupil <- . %>%
                    group_by(!!sym(school)) %>%
                    summarize(per_pupil = sum(amount)/n)

If I understand tidyeval, then this is what’s going on.

  • sym evaluates school and makes the result a symbol
  • and !! says, roughly “evaluate that symbol now”.

This way with params$school having the value "school_name", sym(school) creates evaulates that to "school_name" and then makes it an unquoted symbol school_name. Then !! tells R “You can evaluate this next thing in place as it is.”

I originally wrote this post trying to understand enquo, but I never got it to work right and it makes no sense to me yet. What’s great is that rlang::sym and rlang::syms with !! and !!! respectively work really well so far. There is definitely less flexibility– with the full on quosure stuff you can have very complex evaluations. But I’m mostly worried about having very generic names for my data so sym and syms seems to work great.

May 6, 2017
April 28, 2017

I think this is my new favorite tree.

April 23, 2017

We’re not even done brushing her. @westminsterfox

April 11, 2017

Maxwell House purchased literally three decades of advertising with these free Haggadahs.