November 12, 2014

That’s @edure killing it with #blueapron

November 6, 2014

Just some of the @allovuebalance extended family tonight at #1106party

November 1, 2014

November marks the start of National Novel Writing Month (NaNoWriMo). The quick version is folks band together and support each other to write 50,000 words in November.

I would love to write a novel one day. I am not sure I could do it well, but I am pretty sure I could hit 50,000-80,000 words if I dedicated time to tell a story.

I don’t have a story to tell.

So this year, I have decided to not feel guilty about skipping out on another NaNoWriMo (always the reader, never the author), and instead I am modifying it to meet my needs. With no story to tell and no experience tackling a single project the size of a novel, I am going to tackle a smaller problem– this blog.

Instead of 50,000 words in 30 days, I am going to try and write 1000 words a day for the next four weeks. I will not hold myself to a topic. I will not even hold myself to non-fiction. I will not hold myself to a number of posts or the size of the posts I write. I will not even hold myself to true daily count, instead reviewing where I stand at the end of each week.

I am hoping that the practice of simply writing will grease my knuckles and start the avalanche that leads to writing more. A small confession– I write two or three blog posts every week that never leave my drafts. I find myself unable to hit publish because the ideas tend to be far larger or far smaller than I anticipate when I set out to write and share my frustrations. I also get nervous, particularly when writing about things I do professionally, about not writing the perfect post that’s clear, heavily researched, and expresses my views definitively and completely. This month, I say goodbye to that anxiety and start simply hitting publish.

I will leave you with several warnings.

  1. Things might get topically wacky. I might suddenly become a food blogger, or write about more personal issues, or write a short story and suddenly whiplash to talking about programming, education policy, or the upcoming election. If high volume, random topics aren’t your thing, you should probably unsubscribe from my RSS feed and check back in a month.
  2. I might write terrible arguments that are poorly supported and don’t reflect my views. This month, I will not accept my most common excuses for not publishing, which boil down to fear people will hold me to the views I express in my first-draft thinking. I am going to make mistakes this month in public and print the dialog I am having with myself. The voices I allow room to speak as I struggle with values, beliefs, and opinions may be shock and offend. This month, this blog is my internal dialog. Please read it as a struggle, however definitive the tone.
  3. I am often disappointed that the only things I publish are smaller ideas written hastily with poor editing. Again, this month I embrace the reality that almost everything I write that ends up published is the result of 20 minutes of furious typing with no looking back, rather than trying to be a strong writer with a strong view point and strong support.

I hope that the end of this month I will have written at least a couple of pieces I feel proud of, and hopefully, I will have a little less fear of hitting publish in the future.

August 14, 2014
July 26, 2014

Don’t wake up early in the morning to pee unless you’re ok with losing your spot.

Got these in gold. Wearing non-stop. First pair of sunglasses since I was 9 and went prescription only.

May 26, 2014

This is what happens when you attend a graduation and forget sunblock.

May 21, 2014
May 19, 2014

I have never found dictionaries or even a thesaurus particularly useful as part of the writing process. I like to blame this on my lack of creative careful writing.

But just maybe, I have simply been using the wrong dictionaries. It is hard not to be seduced by the seeming superiority of Webster’s original style. A dictionary that is one-part explanatory and one-part exploratory provides a much richer experience of English as an enabler of ideas that transend meager vocabulary.

May 14, 2014
May 12, 2014

I had never thought of a use for Brett Terpstra’s Marky the Markdownifier before listening today’s Systematic. Why would I want to turn a webpage into Markdown?

When I heard that Marky has an API, I was inspired. Pinboard has a “description” field that allows up to 65,000 characters. I never know what to put in this box. Wouldn’t it be great to put the full content of the page in Markdown into this field?

I set out to write a quick Python script to:

  1. Grab recent Pinboard links.
  2. Check to see if the URLs still resolve.
  3. Send the link to Marky and collect a Markdown version of the content.
  4. Post an updated link to Pinboard with the Markdown in the description field.

If all went well, I would release this script on Github as Pindown, a great way to put Markdown page content into your Pinboard links.

The script below is far from well-constructed. I would have spent more time cleaning it up with things like better error handling and a more complete CLI to give more granular control over which links receive Markdown content.

Unfortunately, I found that Pinboard consistently returns a 414 error code because the URLs are too long. Why is this a problem? Pinboard, in an attempt to maintain compatibility with the del.ico.us API uses only GET requests, whereas this kind of request would typically use a POST end point. As a result, I cannot send along a data payload.

So I’m sharing this just for folks who are interested in playing with Python, RESTful APIs, and Pinboard. I’m also posting for my own posterity since a non-Del.ico.us compatible version 2 of the Pinboard API is coming.

import requests
import json
import yaml


def getDataSet(call):
  r = requests.get('[api.pinboard.in/v1/posts/...](https://api.pinboard.in/v1/posts/recent') + call)
  data_set = json.loads(r._content)
  return data_set

def checkURL(url=""):
  newurl = requests.get(url)
  if newurl.status_code==200:
    return newurl.url
  else:
    raise ValueError('your message', newurl.status_code)

def markyCall(url=""):
  r = requests.get('[heckyesmarkdown.com/go/](http://heckyesmarkdown.com/go/?u=') + url)
  return r._content

def process_site(call):
  data_set = getDataSet(call)
  processed_site = []
  errors = []
  for site in data_set['posts']:
    try:
      url = checkURL(site['href'])
    except ValueError:
      errors.append(site['href'])
    description = markyCall(url)
    site['extended'] = description
    processed_site.append(site)
  print errors
  return processed_site

def write_pinboard(site, auth_token):
  stem = 'https://api.pinboard.in/v1/posts/add?format=json&auth_token='
  payload = {}
  payload['url'] = site.get('href')
  payload['description'] = site.get('description', '')
  payload['extended'] = site.get('extended', '')
  payload['tags'] = site.get('tags', '')
  payload['shared'] = site.get('extended', 'no')
  payload['toread'] = site.get('toread', 'no')           
  r = requests.get(stem + auth_token, params = payload)
  print(site['href'] + '\t\t' + r.status_code)

def main():
  settings = file('AUTH.yaml', 'rw')
  identity = yaml.load(AUTH.yaml)
  auth_token = identity['user_name'] + ':' + identity['token']
  valid_sites = process_site('?format=json&auth_token=' + auth_token)
  for site in valid_sites:
    write_pinboard(site, auth_token)

if __name__ == '__main__':
  main()
April 27, 2014
April 12, 2014
April 1, 2014

I frequently work with private data. Sometimes, it lives on my personal machine rather than on a database server. Sometimes, even if it lives on a remote database server, it is better that I use locally cached data than query the database each time I want to do analysis on the data set. I have always dealt with this by creating encrypted disk images with secure passwords (stored in 1Password). This is a nice extra layer of protection for private data served on a laptop, and it adds little complication to my workflow. I just have to remember to mount and unmount the disk images.

However, it can be inconvenient from a project perspective to refer to data in a distant location like /Volumes/ClientData/Entity/facttable.csv. In most cases, I would prefer the data “reside” in data/ or cache/ “inside” of my project directory.

Luckily, there is a great way that allows me to point to data/facttable.csv in my R code without actually having facttable.csv reside there: symlinking.

A symlink is a symbolic link file that sits in the preferred location and references the file path to the actual file. This way, when I refer to data/facttable.csv the file system knows to direct all of that activity to the actual file in /Volumes/ClientData/Entity/facttable.csv.

From the command line, a symlink can be generated with a simple command:

ln -s target_path link_path

R offers a function that does the same thing:

file.symlink(target_path, link_path)

where target_path and link_path are both strings surrounded by quotation marks.

One of the first things I do when setting up a new analysis is add common data storage file extensions like .csv and .xls to my .gitignore file so that I do not mistakenly put any data in a remote repository. The second thing I do is set up symlinks to the mount location of the encrypted data.

March 10, 2014
March 9, 2014

Education data often come in annual snapshots. Each year, students are able to identify anew, and while student identification numbers may stay the same, names, race, and gender can often change. Sometimes, even data that probably should not change, like a date of birth, is altered at some point. While I could spend all day talking about data collection processes and automated validation that should assist with maintaining clean data, most researchers face multiple characteristics per student, unsure of which one is accurate.

While it is true that identity is fluid, and sex/gender or race identifications are not inherently stable overtime, it is often necessary to “choose” a single value for each student when presenting data. The Strategic Data Project does a great job of defining the business rules for these cases in its diagnostic toolkits.

If more than one [attribute value is] observed, report the modal [attribute value]. If multiple modes are observed, report the most recent [attribute value] recorded.

This is their rule for all attributes considered time-invariant for analysis purposes. I think it is a pretty good one.

Implementing this rule turned out to be more complex than it appeared using R, especially with performant code. In fact, it was this business rule that led me to learn how to use the data.table package.

First, I developed a small test set of data to help me make sure my code accurately reflected the expected results based on the business rule:

# Generate test data for modal_attribute().
modal_test <- data.frame(sid = c('1000', '1001', '1000', '1000', '1005', 
                                 '1005', rep('1006',4)),
                         race = c('Black', 'White', 'Black', 'Hispanic',
                                  'White', 'White', rep('Black',2), 
                                  rep('Hispanic',2)),
                         year = c(2006, 2006, 2007, 2008,
                                  2010, 2011, 2007, 2008,
                                  2010, 2011))

The test data generated by that code looks like this:

sasid race year
1000 Black 2006
1001 White 2006
1000 Black 2007
1000 Hispanic 2008
1005 White 2010
1005 White 2011
1006 Black 2007
1006 Black 2008
1006 Hispanic 2010
1006 Hispanic 2011

And the results should be:

sasid race
1000 Black
1001 White
1005 White
1006 Hispanic

My first attempts at solving this problem using data.table resulted in a pretty complex set of code.

# Calculate the modal attribute using data.table
modal_person_attribute_dt <- function(df, attribute){
  # df: rbind of all person tables from all years
  # attribute: vector name to calculate the modal value
  # Calculate the number of instances an attributed is associated with an id
  dt <- data.table(df, key='sasid')
  mode <- dt[, rle(as.character(.SD[[attribute]])), by=sasid]
  setnames(mode, c('sasid', 'counts', as.character(attribute)))
  setkeyv(mode, c('sasid', 'counts'))
  # Only include attributes with the maximum values. This is equivalent to the
  # mode with two records when there is a tie.
  mode <- mode[,subset(.SD, counts==max(counts)), by=sasid]
  mode[,counts:=NULL]
  setnames(mode, c('sasid', attribute))
  setkeyv(mode, c('sasid',attribute))
  # Produce the maximum year value associated with each ID-attribute 
  # pairing    
  setkeyv(dt, c('sasid',attribute))
  mode <- dt[,list(schoolyear=max(schoolyear)), by=c("sasid", attribute)][mode]
  setkeyv(mode, c('sasid', 'schoolyear'))
  # Select the last observation for each ID, which is equivalent to the highest
  # schoolyear value associated with the most frequent attribute.
  result <- mode[,lapply(.SD, tail, 1), by=sasid]
  # Remove the schoolyear to clean up the result
  result <- result[,schoolyear:=NULL]
  return(as.data.frame(result))
}

This approached seemed “natural” in data.table, although it took me a while to refine and debug since it was my first time using the package 1. Essentially, I use rle, a nifty function I used in the past for my Net-Stacked Likert code to count the number of instances of an attribute each student had in their record. I then subset the data to only the max count value for each student and merge these values back to the original data set. Then I order the data by student id and year in order to select only the last observation per student.

I get a quick, accurate answer when I run the test data through this function. Unfortunately, when I ran the same code on approximately 57,000 unique student IDs and 211,000 total records, the results were less inspiring. My Macbook Air’s fans spin up to full speed and timings are terrible:

> system.time(modal_person_attribute(all_years, 'sex'))
 user  system elapsed 
 40.452   0.246  41.346 

Data cleaning tasks like this one are often only run a few times. Once I have the attributes I need for my analysis, I can save them to a new table in a database, CSV, or similar and never run it again. But ideally, I would like to be able to build a document presenting my data completely from the raw delivered data, including all cleaning steps, accurately. So while I may use a cached, clean data set for some the more sophisticated analysis while I am building up a report, in the final stages I begin running the entire analyses process, including data cleaning, each time I produce the report.

With the release of dplyr, I wanted to reexamine this particular function because it is one of the slowest steps in my analysis. I thought with fresh eyes and a new way of expressing R code, I may be able to improve on the original function. Even if its performance ended up being fairly similar, I hoped the dplyr code would be easier to maintain since I frequently use dplyr and only turn to data.table in specific, sticky situations where performance matters.

In about a tenth the time it took to develop the original code, I came up with this new function:

modal_person_attribute <- function(x, sid, attribute, year){
  grouping <- lapply(list(sid, attribute), as.symbol)
  original <- x
  max_attributes <- x %.% 
                    regroup(grouping) %.%
                    summarize(count = n()) %.%
                    filter(count == max(count))
  recent_max <- left_join(original, max_attributes) %.%
                regroup(list(grouping[[1]])) %.%
                filter(!is.na(count) & count == max(count))
  results <- recent_max %.% 
             regroup(list(grouping[[1]])) %.%
             filter(year == max(year))
  return(results[,c(sid, attribute)])
}

At least to my eyes, this code is far more expressive and elegant. First, I generate a data.frame with only the rows that have the most common attribute per student by grouping on student and attribute, counting the size of those groups, and filtering to most common group per student. Then, I do join on the original data and remove any records without a count from the previous step, finding the maximum count per student ID. This recovers the year value for each of the students so that in the next step I can just choose the rows with the highest year.

There are a few funky things (note the use of regroup and grouping, which are related to dplyr’s poor handling of strings as arguments), but for the most part I have shorter, clearer code that closely resembles the plain-English stated business rule.

But was this code more performant? Imagine my glee when this happened:

> system.time(modal_person_attribute_dplyr(all_years, sid='sasid', 
> attribute='sex', year='schoolyear'))
Joining by: c("sasid", "sex")
   user  system elapsed 
  1.657   0.087   1.852 

That is a remarkable increase in performance!

Now, I realize that I may have cheated. My data.table code isn’t very good and could probably follow a pattern closer to what I did in dplyr. The results might be much closer in the hands of a more adept developer. But the take home message for me was that dplyr enabled me to write the more performant code naturally because of its expressiveness. Not only is my code faster and easier to understand, it is also simpler and took far less time to write.

It is not every day that a tool provides powerful expressiveness and yields greater performance.

Update

I have made some improvements to this function to simplify things. I will be maintaining this code in my PPSDCollegeReadiness repository.

modal_person_attribute <- function(x, sid, attribute, year){
  # Select only the important columns
  x <- x[,c(sid, attribute, year)]
  names(x) <- c('sid', 'attribute', 'year')
  # Clean up years
  if(TRUE %in% grepl('_', x$year)){
    x$year <- gsub(pattern='[0-9]{4}_([0-9]{4})', '\\1', x$year)
  }  
  # Calculate the count for each person-attribute combo and select max
  max_attributes <- x %.% 
                    group_by(sid, attribute) %.%
                    summarize(count = n()) %.%
                    filter(count == max(count)) %.%
                    select(sid, attribute)
  # Find the max year for each person-attribute combo
  results <- max_attributes %.% 
             left_join(x) %.%
             group_by(sid) %.%
             filter(year == max(year)) %.%
             select(sid, attribute)
  names(results) <- c(sid, attribute)
  return(results)
}

  1. It was over a year ago that I first wrote this code. ↩︎

February 26, 2014

We burden Latinos (and other traditionally underserved communities) with expensive housing because of the widespread practice of using homestead exemptions in Rhode Island. By lowering the real estate tax rate, typically by 50%, for owner occupied housing, we dramatically inflate the tax rate paid by Rhode Islanders who are renting.

Echoing a newly filed lawsuit in New York City over discriminatory real estate tax regimes, this new report emphasizes the racist incentives built into our property tax.

Homestead exemptions are built on the belief that renters are non-permanent residents of communities, care less for the properties they occupy and neighborhoods they live in, and are worse additions than homeowners. Frankly, it is an anti-White flight measure meant to assure people that only those with the means to purchase and the intent to stay will join their neighborhoods. Wealthy, largely White, property owners see homestead exemptions as fighting an influx of “slum lords”, which is basically the perception of anyone who purchases a home or builds apartments and rents them out.

Rather than encouraging denser communities with higher land utilization and more housing to reduce the cost of living in dignity, we subsidize low value (per acre) construction that maintain inflated housing costs.

Full disclosure: I own a condo in Providence and receive a 50% discount on my taxes. In fact, living in a condo Downcity, my home value is depressed because of the limited ways that I can use it. I could rent my current condo at market rate and lose money because of the doubling in taxes that I would endure versus turning a small monthly profit at the same rent with higher taxes. The flexibility to use my property as my own residence or as a rental unit more than pays for higher taxes.

So while I do have personal reasons to support removing the homestead exemption, even if I lived in a single family home on the East Side that was not attractive as a rental property, I would still think this situation is absurd. Homeowners' taxes should easily be 20% higher to tax renters 30% less. Maybe some of our hulking, vacant infrastructure could be more viably converted into housing stock and lower the cost for all residents. Maybe we could even see denser development because there will actually be a market for renters at the monthly rates that would need to be charged to recuperate expenses. At least the rent wouldn’t be so damn high for too many people of color and people living in or near poverty.

February 17, 2014

Hadley Wickham has once again1 made R ridiculously better. Not only is dplyr incredibly fast, but the new syntax allows for some really complex operations to be expressed in a ridiculously beautiful way.

Consider a data set, course, with a student identifier, sid, a course identifier, courseno, a quarter, quarter, and a grade on a scale of 0 to 4, gpa. What if I wanted to know the number of a courses a student has failed over the entire year, as defined by having an overall grade of less than a 1.0?

In dplyr:

course %.% 
group_by(sid, courseno) %.%
summarise(gpa = mean(gpa)) %.%
filter(gpa <= 1.0) %.%
summarise(fails = n())

I refuse to even sully this post with the way I would have solved this problem in the past.


  1. Seriously, how many of the packages he has managed/written are indispensable to using R today? It is no exaggeration to say that the world would have many more Stata, SPSS, and SAS users if not for Hadleyverse. ↩︎

February 9, 2014

These quotes are absolutely striking, in that they give a clear glimpse into the ideological commitments of the Republican Party. From Sen. Blunt and Rep. Cole, we get the revelation that— for conservatives— the only “work” worth acknowledging is wage labor. To myself, and many others, someone who retires early to volunteer— or leaves a job to care for their children— is still working, they’re just outside the formal labor market. And indeed, their labor is still valuable— it just isn’t compensated with cash.

One of the greatest benefits of wealth is that it can liberate people to pursue happiness. When we tie a basic need for living complete lives of dignity to full time employment, people will find themselves willing to make many sacrifices to ensure this need. In our nation of great wealth with liberty and freedom as core values, it is hard to believe that the GOP would decry the liberating effect of ending the contingency of health care on work.

There is no work rule, regulation, or union that empowers workers more in their relationship with their employers than removing the threat of losing health care from the table. An increasingly libertarian right should be celebrating this as a key victory, rather than celebrate the existing coercive impact that health care has in our lives.

Republicans aren’t as worried as the idle rich, who— I suppose— have earned the right to avoid a life of endless toil. Otherwise— if Republicans really wanted everyone to work as much as possible— they’d support confiscatory tax rates. After all, nothing will drive an investment banker back to the office like the threat of losing 70 percent of her income to Uncle Sam.

Oh yeah, I forgot. For all their claims to loving liberty and freedom, what the GOP really stands for is protecting liberty and freedom for the existing “deserving” wealthy. They will fight tooth and nail to remove estate taxes because inheritance is a legitimate source of liberty. Removing the fear of entering a hospital uninsured after being unable to access preventive care is what deprives folks of “dignity”.

February 5, 2014

My Democracy Prep colleague Lindsay Malanga and I often say we should start an organization called the Coalition of Pretty Good Schools. We’d start with the following principles.

  1. Every child must have a safe, warm, disruption-free classroom as a non-negotiable, fundamental right.
  2. All children should be taught to read using phonics-based instruction.
  3. All children must master basic computational skills with automaticity before moving on to higher mathematics.
  4. Every child must be given a well-rounded education that includes science, civics, history, geography, music, the arts, and physical education.
  5. Accountability is an important safeguard of public funds, but must not drive or dominate a child’s education. Class time must not be used for standardized test preparation.

We have no end of people ready to tell you about their paradigmatic shift that will fix education overnight. There has been plenty of philosophizing about the goals, purpose, and means of education. Everyone is ready to pull out tropes about the “factory model” of education our system is built on.

The reality is that the education system too often fails at very basic delivery, period. I would love to see more folks draw a line in the sand of their minimum basic requirements, and not in an outrageous, political winky-wink where they are wrapping thier ideal in the language of the minimum. Lets have a deep discussion right now about the minimum basic requirements and lets get relentless about making that happen without the distraction of the dream. Frankly, whatever your dream is, so long as it involves kids going somewhere to learn 1, if we can’t deliver on the basics it will be dead on arrival.


  1. Of course, for a group of folks who are engaged in Dreamschooling, we cannot take for granted that schools will be places or that children will be students in any traditional sense of the word. However, I believe that if we have a frank conversation about the minimum expectations for education I suspect this will not be a particularly widely held sentiment. If our technofuturism does complete its mindmeld with the anarcho-____ movements on the left and right to lead to a dramatically different conceptualization of childhood in the developed world in my lifetime… ↩︎

December 9, 2013

We find that public schools offered practically zero return education on the margin, yet they did enjoy significant political and financial support from local political elites, if they taught in the “right” language of instruction.

One thing that both progressives and libertarians agree upon are that social goals of education are woefully underappreciated and considered in the current school reform discussion. Both school choice and local, democratic control of schools are reactions to centralization resulting in “elites… [selecting] the ‘right’ language of instruction.”

I am inclined to agree with neither.

December 3, 2013

Update

Turns out the original code below was pretty messed up. All kinds of little errors I didn’t catch. I’ve updated it below. There are a lot of options to refactor this further that I’m currently considering. Sometimes it is really hard to know just how flexible something this big really should be. I think I am going to wait until I start developing tests to see where I land. I have a feeling moving toward a more test-driven work flow is going to force me toward a different structure.

I recently updated the function I posted about back in June that calculates the difference between two dates in days, months, or years in R. It is still surprising to me that difftime can only return units from seconds up until weeks. I suspect this has to do with the challenge of properly defining a “month” or “year” as a unit of time, since these are variable.

While there was nothing wrong with the original function, it did irk me that it always returned an integer. In other words, function returned only complete months or years. If the start date was on 2012-12-13 and the end date was on 2013-12-03, the function would return 0 years. Most of the time, this is the behavior I expect when calcuating age. But it is completely reasonable to want to include partial years or months, e.g. in the aforementioned example returning 0.9724605.

So after several failed attempts because of silly errors in my algorithm, here is the final code. It will be released as part of eeptools 0.3 which should be avialable on CRAN soon 1.

age_calc <- function(dob, enddate=Sys.Date(), units='months', precise=TRUE){
  if (!inherits(dob, "Date") | !inherits(enddate, "Date")){
    stop("Both dob and enddate must be Date class objects")
  }
  start <- as.POSIXlt(dob)
  end <- as.POSIXlt(enddate)
  if(precise){
    start_is_leap <- ifelse(start$year %% 400 == 0, TRUE, 
                        ifelse(start$year %% 100 == 0, FALSE,
                               ifelse(start$year %% 4 == 0, TRUE, FALSE)))
    end_is_leap <- ifelse(end$year %% 400 == 0, TRUE, 
                        ifelse(end$year %% 100 == 0, FALSE,
                               ifelse(end$year %% 4 == 0, TRUE, FALSE)))
  }
  if(units=='days'){
    result <- difftime(end, start, units='days')
  }else if(units=='months'){
    months <- sapply(mapply(seq, as.POSIXct(start), as.POSIXct(end), 
                            by='months', SIMPLIFY=FALSE), 
                     length) - 1
    # length(seq(start, end, by='month')) - 1
    if(precise){
      month_length_end <- ifelse(end$mon==1, 28,
                                 ifelse(end$mon==1 & end_is_leap, 29,
                                        ifelse(end$mon %in% c(3, 5, 8, 10), 
                                               30, 31)))
      month_length_prior <- ifelse((end$mon-1)==1, 28,
                                     ifelse((end$mon-1)==1 & start_is_leap, 29,
                                            ifelse((end$mon-1) %in% c(3, 5, 8, 
                                                                      10), 
                                                   30, 31)))
      month_frac <- ifelse(end$mday > start$mday,
                           (end$mday-start$mday)/month_length_end,
                           ifelse(end$mday < start$mday, 
                            (month_length_prior - start$mday) / 
                                month_length_prior + 
                                end$mday/month_length_end, 0.0))
      result <- months + month_frac
    }else{
      result <- months
    }
  }else if(units=='years'){
    years <- sapply(mapply(seq, as.POSIXct(start), as.POSIXct(end), 
                            by='years', SIMPLIFY=FALSE), 
                     length) - 1
    if(precise){
      start_length <- ifelse(start_is_leap, 366, 365)
      end_length <- ifelse(end_is_leap, 366, 365)
      year_frac <- ifelse(start$yday < end$yday,
                          (end$yday - start$yday)/end_length,
                          ifelse(start$yday > end$yday, 
                                 (start_length-start$yday) / start_length +
                                end$yday / end_length, 0.0))
      result <- years + year_frac
    }else{
      result <- years
    }
  }else{
    stop("Unrecognized units. Please choose years, months, or days.")
  }
  return(result)
}

  1. I should note that my mobility function will also be included in eeptools 0.3. I know I still owe a post on the actual code, but it is such a complex function I have been having a terrible time trying to write clearly about it. ↩︎

December 2, 2013

PISA Results

I wanted to call attention to these interesting PISA results. Turns out that student anxiety in the United States is lower than the OECD average and belief in ability is higher 1. I thought that all of the moves in education since the start of standard’s based reform were supposed to be generating tremendous anxiety and failing to produce students who had high sense of self-efficacy?

It is also worth noting that students in the United States were more likely to skip out on school dand this had a higher than typical impact on student performance. One interpretation of this could be that students are less engaged, but also that schooling activities do have a large impact on students rather than schools being of lesser importance than student inputs.

I have always had a hard time reconciling the calls for higher teacher pay and better work conditions and evidence that missing even just 10% of schooling has a huge impact on student outcomes with the belief that addressing other social inequities is the key way to achieve better outcomes for kids.

This is all an exercise in nonsense. It is incredibly difficult to transfer findings from surveys across dramatical cultural differences. It is also hard to imagine what can be learned about the delivery of education in the dramatically different contexts that exists. The whole international comparison game seems like one big Rorschach test where the price of admission is leaving any understanding of culture, context, and external validity at the door.

P.S.: The use of color in this visualization is awful. There is a sense that they are trying to be “value neutral” with data that is ordinal in nature (above, same, or below), and in doing so chose two colors that are very difficult to distinguish between. Yuck.


  1. The site describes prevalence of anxiety as, “proportion of students who feel helpless when faced with math problems” and belief in ability as, “proportion of students who feel confident in their math abilitites”. Note, based on these defitions, one might also think that either curricula were not so misaligned with international benchmarks or that we are already seeing the fruits of partial transition to Common Core. Not knowing the trend for this data, or some of the specifics about the collection instrument, makes that difficult to assess. ↩︎