I’m afraid of a world without brave, outspoken Holocaust survivors to hold our collective consciousness to account. Brutality, mass violence, and oppression are not distant possibilities; they remain reality.
This shit is toxic. The casual abuse women face in tech is deeply disturbing. I can’t even imagine this kind of behavior.
I am shocked at how slow a fairly straightforward DELETE
is on PostgreSQL. Stupid nested data with 1:M relationship. This seems absurd:
DELETE 25560
Time: 262387.789 ms
(cascades to 76680 records in another table as well in this example)
I believe in owning my own space on the web. I have had some form of a blog since LiveJournal, but frequently burn it down to the ground. For a while I’ve maintained a static site, first using Pelican and now Hugo/blogdown. I’ve never been happy with my post frequency, yet I now have over 60,000 tweets. After months of waffling and considering launching my own micro blog using Hugo, I just decided I’d rather pay @manton and get it up and running. If microblogging is the format that keeps me writing, it’s time to not just embrace it, but to support the kind of microblogging that I believe in. Off to figure out how to point micro.json.blog here.
Missed my GoodReads challenge 2 years in a row. So I started this year by breaking one of my rules and doing a couple of rereads— The Subtle Knife & The Amber Spyglass. Hadn’t read either since they were published. Now I’m on my sixth book this year. 💪🏻📚
I have some text, but I want the content of that text to be dynamic based on data. This is a case for string interpolation. Lots of languages have the ability to write something like
pet = "dog"
puts "This is my {#pet}"
pet = "dog"
print(f"This is my {pet}")
There have been ways to do this in R, but I’ve mostly hated them until glue
came along. Using glue
in R should look really familiar now:
pet <- "dog"
glue("This is my {pet}")
Awesome! Now I have a way to make text bend to my bidding using data. But this is pretty simple, and we could have just used something like paste("This is my", pet)
and been done with it.
Let me provide a little motivation in the form of data.frame
s, glue_data
, and some purrr
.
Pretend we have a field in a database called notes
. I want to set the notes
for each entity to follow the same pattern, but use other data to fill in the blanks. Like maybe something like this:
notes <- "This item price is valid through {end_date} and will then increase {price_change} to {new_price}."
This is a terrible contrived example, but we can imagine displaying this note to someone with different content for each item. Now in most scenarios, the right thing to do for an application is to produce this content dynamically based on what’s in the database, but let’s pretend no one looked far enough ahead to store this data or like notes can serve lots of different purposes using different data. So there is no place for the application to find end_date
, price_change
, or new_price
in its database. Instead, this was something prepared by sales in Excel yesterday and they want these notes added to all items to warn their customers.
Here’s how to take a table that has item_id
, end_date
, price_change
, and new_price
as columns and turn it into a table with item_id
, and notes
as columns, with your properly formatted note for each item to be updated in a database.
library(glue)
library(purrr)
item_notes <- data.frame(
item_id = seq_len(10),
end_date = c(rep(as.Date('2018-03-01', format = '%Y-%m-%d'), 5),
rep(as.Date('2018-03-05', format = '%Y-%m-%d'), 3),
rep(as.Date('2018-03-09', format = '%Y-%m-%d'), 2)),
price_change = sample(x = seq_len(5),replace = TRUE,size = 10),
new_price = sample(x = 10:20,replace = TRUE,size = 10)
)
template <- "This item price is valid through {end_date} and will then increase {price_change} to {new_price}."
map_chr(split(item_notes, item_notes$item_id),
glue_data,
template) %>%
stack() %>%
rename(item_id = ind,
notes = values)
What’s going on here? First, I want to apply my glue
technique to rows of a data.frame
,
so I split
the data into a list
using item_id
as the identifier. That’s because at the end of all this I want to preserve that id to match back up in a database. 1 The function glue_data
works like glue
, but it accepts things that are “listish” as it’s first argument (like data.frames
and named lists
). So with handy map
over my newly created list
of “listish” data, I create a named list
with the text I wanted to generate. I then use a base R function that’s new to me stack
, which will take a list and make each element a row in a data.frame
with ind
as the name of the list
element and values
as the value.
Now I’ve got a nice data.frame
, ready to be joined with any table that has item_id
so it can have the attached note!
-
You can split on
row.names
if you don’t have a similar identifer and just want to go fromdata.frame
to alist
of your rows. ↩︎
Leftovers
I have been using ggplot2
for 7 years I think. In all that time, I’ve been frustrated that I can never figure out what order to put my color values in for scale_*_manual
. Not only is the order mapping seemingly random to me, I know that sometimes if I change something about how I’m treating the data, the order switches up.
Countless hours could have been saved if I knew that this one, in hindsight, obvious thing was possible.
Whenever using scale_*_manual
, you can directly reference a color using a character vector and then name your value
in the scale_
call like so:
geom_blah(aes(color = 'good')) +
geom_blah(aes(color = 'bad')) +
scale_blah_manual(values = c(good = 'green', bad = 'red'))
Obviously this is a toy example, but holy game changer.
Looking back on 2017, there were three major trends in my R code: the end of S4, directly writing to SQL database, and purrr
everywhere.
The End of S4
The first package I ever wrote extensively used S4 classes. I wanted to have the security of things like setValidity
. I liked the idea of calling new
as it felt more like class systems I was familiar with from that one semester of Java in college. S4 felt more grown up than S3, more like it was utilizing the advantages of object oriented programming, and less exotic than R6, which in 2014 felt riskier to build with and teach future employees. Using S4 was a mistake from day one and never led to any advantages in the code I wrote.
So this year, I rewrote that original package. It’s internal (and a core function) at my job so I can’t share too much, but this was a long time coming. Not only did I clean up a lot of code that was just plain bad (in the way all old code is), but I got rid of S4 in favor of S3 or more functional code wherever possible. Our test coverage is far more complete, the code is far easier to extend without duplication, and it looks far more idiomatic to the standard non-BioConductor R user.
What’s the lesson learned here? From a technical perspective, it would be avoid premature optimization and, of course, that everyone can and wants to throw out old code they revist with greater knowledge and context. But I know those things. What drove me to making the wrong decision here was purely imposter syndrome. I was writing code that had to be run unattended on a regular basis as a part of a product in a new job. I didn’t feel up to the task, so I felt working with a new, complex, scary part of R that promised some notion of “safety” would mean I really knew what I was doing. So my takeaway from walking away from S4 is this: start small, build what you know, have confidence you can solve problems one at a time, and trust yourself.
Directly Writing SQL
I use SQL far more than R, but almost entirely as a consumer (e.g. SELECT
only). I’ve almost always directly used SQL for my queries into other people’s data, but rarely ventured into the world of INSERT
or UPDATE
directly, preferring to use interfaces like dbWriteTable
. This gets back to imposter syndrome– there’s so little damage that can be done with a SELECT
statement, but writing into databases I don’t control means taking on risk and responsiblity.
This year I said fuck it– there’s a whole lot of work and complexity going on that’s entirely related to me not wanting to write INSERT INTO
a PostgreSQL has the amazing ON CONFLICT...
-based “upserts” now. So I started to write a lot of queries, some of them pretty complex 1. R is a great wrapper language, and it’s database story is getting even better with the new DBI, odbc, and RPostgres packages. Although it’s native table writing support is a little weak, there’s no problem at all just using dbSendStatement
with complex queries. I’ve fallen into a pattern I really like of writing temporary tables (with dplyr::copy_to
because it’s clean in a pipeline) and then executing complex SQL with dbSendStatement
. In the future, I might be inclined to make these database functions, but either way this change has been great. I feel more confident than ever working with databases and R (my two favorite places to be) and I have been able to simplify a whole lot of code that involved passing around text files (and boy do I hate the type inference and other madness that can happen with CSVs. Oy.).
purrr
This is the year that purrr
not only clicked, but became my preferred way to write code. Where there was apply
, now there was purrr
. Everything started to look like a list. I’m still only scratching the surface here, but I love code like this:
locations %>%
filter(code %in% enrollment$locations) %$%
code %>%
walk(function(x) render(input = 'schprofiles.Rmd',
html_document(theme = NULL,
template = NULL,
self_contained = FALSE,
css = 'static/styles.css',
lib_dir = 'cache/demo/output/static/',
includes = includes('fonts.html')),
params = list(school_code = x),
output_file = paste0(x,'.html'),
output_dir = "cache/demo/output/"))
It’s a simple way to run through all of the locations
(a data.frame
with columns code
and name
) and render an HTML-based profile of each school (defined by having student enrollment). walk
is beautiful, and so is purrr
. I mean, who does need to do map(., mutate_if, is.numeric, as.character)
10 times a day?
2018 R Goals
One thing that’s bittersweet is that 2017 is probably the last year in a long time that writing code is the main thing my every day job is about. With increased responsibility and the growth of my employees, I find myself reviewing code a lot more than writing it, and sometimes not even that. With that in mind, I have a few goals for 2018 that I hope will keep the part of me that loves R engaged.
First, I want to start writing command line utilities using R. I know almost nothing beyond Rscript -e
or ./script.sh
when it comes to writing a CLI. But there are all kinds of tasks I do every day that could be written as small command line scripts. Plus, my favorite part of package authoring is writing interfaces for other people to use. How do I expect someone to want to use R and reason about a problem I’m helping to solve? It’s no wonder that I work on product every day with this interest. So I figure one way to keep engaged in R is to learn how to design command line utilities in R and get good at it. Rather than write R code purely intended to be called and used from R, my R code is going to get an interface this year.
Like every year, I’d like to keep up with this blog. I never do, but this year I had a lot of encouraging signs. I actually got considerable attention for every R-related post (high hundreds of views), so I think it’s time to lean into that. I’m hoping to write one R related post each week. I think the focus will help me have some chance of pulling this off. Since I also want to keep my R chops alive while I move further and further away from day to day programming responsibilities, it should be a two birds with one stone scenario. One major thing I haven’t decided– do I want to submit to r-bloggers? I’m sure it’d be a huge source of traffic, but I find it frustrating to have to click through from my RSS reader of choice when finding things there.
Lastly, I’d like to start to understand the internals of a core package I use every day. I haven’t decided what that’ll be. Maybe it’ll be something really fundamental like dplyr
, DBI
, or ggplot2
. Maybe it’ll be something “simpler”. But I use a lot more R code than I read. And one thing I’ve learned every time I’ve forced myself to dig in is that I understand more R than I thought and also that reading code is one of the best ways to learn more. I want to do at least one deep study that advances my sense of self-R-worth. Maybe I’ll even have to take the time to learn a little C++ and understand how Rccp is being used to change the R world.
Special Thanks
The #rstats world on Twitter has been the only reason I can get on that service anymore. It’s a great and positive place where I learn a ton and I really appreciate feeling like there is a family of nerds out there talking about stuff that I feel like no one should care about. My tweets are mostly stupid musings that come to me and retweeting enraging political stuff in the dumpster fire that is Trump’s America, so I’m always surprised and appreciative that anyone follows me. It’s so refreshing to get away from that and just read #rstats. So thank you for inspiring me and teaching me and being a fun place to be.
-
I let out quite the “fuck yea!” when I got that two-common-table-expression, two joins with one lateral join in an upsert query to work. ↩︎
#domesticgoddess
Hard to complain about the traveling I do when I get to see this outside of the window. Off to Miami.
New favorite tree candidate?
Feels like fall, but it looks like summer.
Killer title at front desk of a new @allovuebalance district
Pork tenderloin is inexpensive, low fat, high protein. Great for watching weight. The lack of fat can make them tricky to cook, but sous vide at 130 makes it easy to make amazing, tender, and juicy pork a week’s worth at a time.
My dogs have an Instagram now. #furbaby @gracie.and.brandy
Same
Still my favorite spot, especially when I’m feeling down and out.
I have not yet spent the time to figure out how to generate a JSON feed in Hugo yet. But I have built an R package to play with JSON feeds. It’s called jsonfeedr, and it’s silly simple.
Maybe I’ll extend this in the future. I hope people will submit PRs to expand it. For now, I was inspired by all the talk about why JSON feed even exists. Working with JSON is fun and easy. Working with XML is not.
Anyway, I figured the guy who registered json.blog should have a package out there working with JSON.
Finishes for the new house. Colors are a bit off in photos. In my defense, this is after like 6hrs in the home gallery.
Non-standard evaluation is one of R’s best features, and also one of it’s most perplexing. Recently I have been making good use of wrapr::let
to allow me to write reusable functions without a lot of assumptions about my data. For example, let’s say I always want to group_by
schools when adding up dollars spent, but that sometimes my data calls what is conceptually a school schools
, school
, location
, cost_center
, Loc.Name
, etc. What I have been doing is storing a set of parameters in a list
that mapped the actual names in my data to consistent names I want to use in my code. Sometimes that comes from using params
in an Rmd file. So the top of my file may say something like:
params:
school: "locations"
amount: "dollars"
enrollment: n
In my code, I may want to write a chain like
create_per_pupil <- . %>%
group_by(school) %>%
summarize(per_pupil = sum(amount) / n)
pp <- district_data %>%
create_per_pupil
Only my problem is that school
isn’t always school
. In this toy case, you could use group_by_(params$school)
, but it’s pretty easy to run into limitations with the _
functions in dplyr
when writing functions.
Using wrapr::let
, I can easily use the code above:
let(alias = params, {
create_per_pupil <- . %>%
group_by(school) %>%
summarize(per_pupil = sum(amount)/n)
})
pp <- district_data %>%
create_per_pupil
The core of wrapr::let
is really scary.
body <- strexpr
for (ni in names(alias)) {
value <- as.character(alias[[ni]])
if (ni != value) {
pattern <- paste0("\\b", ni, "\\b")
body <- gsub(pattern, value, body)
}
}
parse(text = body)
Basically let is holidng onto the code block contained within it, iterating over the list of key-value pairs that are provided, and then runs a gsub
on word boundaries to replace all instances of the list names with their values. Yikes.
This works, I use it all over, but I have never felt confident about it.
The New World of tidyeval
The release of dplyr 0.6 along with tidyeval brings wtih it a ton of features to making programming over dplyr functions far better supported. I am going to read this page by Hadley Wickham at least 100 times. There are all kinds of new goodies (!!!
looks amazing).
So how would I re-write the chain above sans let
?
create_per_pupil <- . %>%
group_by(!!sym(school)) %>%
summarize(per_pupil = sum(amount)/n)
If I understand tidyeval
, then this is what’s going on.
sym
evaluatesschool
and makes the result asymbol
- and
!!
says, roughly “evaluate that symbol now”.
This way with params$school
having the value "school_name"
, sym(school)
creates evaulates that to "school_name"
and then makes it an unquoted symbol school_name
. Then !!
tells R “You can evaluate this next thing in place as it is.”
I originally wrote this post trying to understand enquo
, but I never got it to work right and it makes no sense to me yet. What’s great is that rlang::sym
and rlang::syms
with !!
and !!!
respectively work really well so far. There is definitely less flexibility– with the full on quosure
stuff you can have very complex evaluations. But I’m mostly worried about having very generic names for my data so sym
and syms
seems to work great.
I think this is my new favorite tree.
We’re not even done brushing her. @westminsterfox
Maxwell House purchased literally three decades of advertising with these free Haggadahs.