This week’s On Education column is about the usefulness of some measures in ascertaining teacher quality. The article tells the story of an obviously dedicated teacher, Stacey Isaacson, who is ranked “7th percentile among her teaching peers — meaning 93 per cent are better” despite having “65 of 66 scored proficient on the state language arts test.” The article reflects some healthy frustration about testing, test scores, and what exactly value-added means, but overall I found the article very annoying. Let me take a segment of the article:
Everyone who teaches math or English has received a teacher data report. On the surface the report seems straightforward. Ms. Isaacson’s students had a prior proficiency score of 3.57. Her students were predicted to get a 3.69 — based on the scores of comparable students around the city. Her students actually scored 3.63. So Ms. Isaacson’s value added is 3.63-3.69.
What you would think this means is that Ms. Isaacson’s students averaged 3.57 on the test the year before; they were predicted to average 3.69 this year; they actually averaged 3.63, giving her a value added of 0.06 below zero.
Wrong.
These are not averages. For example, the department defines Ms. Isaacson’s 3.57 prior proficiency as “the average prior year proficiency rating of the students who contribute to a teacher’s value added score.”
Right.
The calculation for Ms. Isaacson’s 3.69 predicted score is even more daunting. It is based on 32 variables — including whether a student was “retained in grade before pretest year” and whether a student is “new to city in pretest or post-test year.”
Those 32 variables are plugged into a statistical model that looks like one of those equations that in “Good Will Hunting” only Matt Damon was capable of solving.
The process appears transparent, but it is clear as mud, even for smart lay people like teachers, principals and — I hesitate to say this — journalists.
The last line is the entire problem with this section. It is his job to tell us what he’s talking about in a way we can understand. That is at the core of journalism. Instead he just seems lazy.
For example: “These are not averages. For example, the department defines Ms. Isaacson’s 3.57 prior proficiency as “the average prior year proficiency rating of the students who contribute to a teacher’s value added score.” It is an average. I don’t know how you can say something is not an average and then quote its definition as beginning with the phrase “the average.” Now that whole sentence from DOE is slightly confusing, but I think all it is saying is that 3.57 is the average prior proficiency of all those student who are in the value added model for Isaacson’s effectiveness (important to note here, its not her prior proficiency but that of her students). There are all sorts of reasons that the particular students that she had in her classroom might not be in the model. That is, some students are likely excluded from the model because DOE doesn’t have values for all the other 32 variables or the student has a documented learning disability.
Value added models are statistically complex – they usually have fixed effects, random effects, shrunken estimates, and much more. They are beyond the statistical expertise of most teachers, administrators, and – yes – journalists. But that isn’t the point. It doesn’t matter that it is a complex model. If that mattered we would be outraged at the fraud detection on our credit cards. No, the real criticism is that value added models are notoriously controversial and, in practice, not very accurate. That’s the attack that should be leveled. As RAND reseachers noted in a often cited review of value added models in education from 2004:
The research base is currently insufficient to support the use of [Value Added Methodology] VAM for high-stakes decisions. We have identified numerous possible sources of error in teacher effects and any attempt to use VAM estimates for high-stakes decisions must be informed by an understanding of these potential errors. However, it is not clear that VAM estimates would be more harmful than the alternative methods currently being used for test-based accountability. At present, it is most important for policymakers, practitioners, and VAM researchers to work together, so that research is informed by the practical needs and constraints facing users of VAM and implementation of the models is informed by an understanding of what inferences and decisions the research currently supports.
As stands, my first reading was that statistical model that are multivariate regressions with 32 variables are something on the order of “Good Will Hunting,” which is a little extreme. He should note that its not the models complexity but its usefulness that is at stake. Either way, instead of stating a useful point – that DOE is using a limited model in a overly serious fashion or value added models are contentious in the academic education literature – he attacks complexity for complexity’s stake.
The article does note the ridiculously large margin of error on these teacher reports: “Moreover, as the city indicates on the data reports, there is a large margin of error. So Ms. Isaacson’s 7th percentile could actually be as low as zero or as high as the 52nd percentile — a score that could have earned her tenure.” If your margin of error is that big, I would agree that you shouldn’t be making tenure decisions based on it.
More generally, I think he misses the point that Isaacson could be a bad teacher despite the glowing anecdotes. She could work really hard but teach very smart students and not teach them all that much year-over-year. We don’t know, that’s true, because the DOE is using a model with a lot of kinks and a crazy margin of error, but that is different than knowing she’s a good teacher, which is what the article suggests.