New York City Eye: Research Shows: Value-Added Teacher Accountability Not Ready for Prime Time

Sunday, September 23, 2012

Research Shows: Value-Added Teacher Accountability Not Ready for Prime Time

In Alternet research cited by supporters of the Chicago Public Schools' quest for value-added modeling (VAM) actually deflate pro-test arguments: "researchers were unable to compare the long term results of high value-added teachers with results of teachers who excelled in other ways that might, conceivably, have even larger impacts on long term outcomes."

Other research concluded: "It concluded that holding teachers accountable for growth in the test scores (called “value-added”) of their students is more harmful than helpful to children’s educations."

What does help? [This point is not the focus of the Alternet article.] A diverse curriculum and attention to children's basic health needs. These concerns are among the objectives of the Chicago Teachers Union strikers, and a fuller discussion can be read in their report, "The Schools Chicago's Children Deserve."

Writing in Alter Net, Richard Rothstein of the Economic Policy Institute raises another concern, so far, little addressed, even among observers that are critical of the link of testing to teacher evaluation: will principals, when they are informed of teachers' value-added scores, use that information in a biased manner against the teachers to skew their own observations of teachers?
"But according to the Chicago district proposal, the observations will be conducted by principals, who will know the value-added scores of teachers they are observing. How principals will be influenced by this knowledge cannot be known—will they tend to give high ratings to teachers with high value-added scores in order not to call attention to possible flaws in their observational skills, will they tend to offset value-added conclusions in order to save favored teachers who have low value-added, or will they tend to sink unfavored teachers with high value-added?"

Rothstein wondered how widespread teacher discontentment is. One specific kind of discontent, that over test-based teacher evaluations, is sure to grow exponentially, as legislation mandating such evaluations spreads like wildfire across the United States. This legislation has been spurred on by President Barack Obama and Education Secretary Arne Duncan's signature reform, Race to the Top. As of October 2011, when the National Council on Teacher Quality published a study, these 17 states and the District of Columbia are using "student achievement" as an "objective" role in assessing teacher performance: Arizona, Colorado, Delaware, D.C., Florida, Idaho, Illinois, Indiana, Louisiana, Maryland, Michigan, Minnesota, Nevada, New York, Ohio, Oklahoma, Rhode Island, and Tennessee.

Richard Rothstein debunks assumptions in the drive for value-added modeling teacher accountability in "Is 'Teacher Accountability' Ready for Prime-Time?," from Alter Net, September 17, 2012

Economic Policy Institute / By Richard Rothstein
Is 'Teacher Accountability' Ready for Prime-Time?
Though Rahm Emanuel wants to put student test scores at the center of teacher evals, there's little proof such measuring sticks make any sense.
September 17, 2012

It was bound to happen, whether in Chicago or elsewhere. What is surprising about the Chicago teachers’ strike is that something like this did not happen sooner.

The strike represents the first open rebellion of teachers nationwide over efforts to evaluate, punish and reward them based on their students’ scores on standardized tests of low-level basic skills in math and reading. Teachers’ discontent has been simmering now for a decade, but it took a well-organized union to give that discontent practical expression. For those who have doubts about why teachers need unions, the Chicago strike is an important lesson.

Nobody can say how widespread discontent might be. Reformers can certainly point to teachers who say that the pressure of standardized testing has been useful, has forced them to pay attention to students they previously ignored, and could rid their schools of lazy and incompetent teachers.

But I frequently get letters from teachers, and speak with teachers across the country who claim to have been successful educators and who are now demoralized by the transformation of teaching from a craft employing skill and empathy into routinized drill instruction using scripted curriculum. They are also demoralized by the weeks and weeks of the school year now devoted to gamesmanship—test preparation designed not to teach literacy or mathematics but only to make it seem that students can perform in an artificial setting better than they actually do.

I suspect, but cannot prove that the latter group of teachers is more numerous and that teachers in the discontented group are more likely to be seasoned, experienced, and successful. I suspect that teachers in the group supportive of standardized testing are more likely to be young, frequently hired outside the usual teacher training stream, and conditioned to think of education as little more than test preparation.

The research evidence is weighty in support of the discontented view; two years ago, EPI assembled a group of prominent testing experts and education policy experts to assess the research evidence on the use of test scores to evaluate teachers. [Eva L. Baker, Paul E. Barton, Linda Darling-Hammond, Edward Haertel, Helen F. Ladd, Robert L. Linn, Diane Ravitch, Richard Rothstein, Richard J. Shavelson, and Lorrie A. Shepard, "Problems with the Use of Student Test Scores to Evaluate Teachers"] It concluded that holding teachers accountable for growth in the test scores (called “value-added”) of their students is more harmful than helpful to children’s educations. Placing serious consequences for teachers on the results of their students’ tests creates rational incentives for teachers and schools to narrow the curriculum to tested subjects, and to tested areas within those subjects. Students lose instruction in history, the sciences, the arts, music, and physical education, and teachers focus less on development of children’s non-cognitive behaviors—cooperative activities, character, social skills—that are among the most important aims of a solid education.

Recently, however, some have made claims to the contrary that there are great benefits to holding teachers accountable for standardized test scores. One study, sponsored by the Gates Foundation, administered a higher quality test of reasoning and critical thinking skills to students who had also taken their state’s high stakes standardized test of basic skills. The Gates researchers found that teachers whose students had high value-added scores on the standardized basic skills test also tended to have high value-added scores on the test of reasoning (i.e., teachers’ value-added on the two tests were positively correlated). This was a potentially important finding because it suggested that narrowing the curriculum as a consequence of high stakes testing is not something about which we should be concerned. If we know that teachers who are effective at teaching basic skills are also effective at developing reasoning skills, then we can hold teachers accountable only for basic skills and be confident that their students are getting both.

But although the teacher results were correlated, they were only weakly correlated. True, more teachers who had high value-added scores on a basic skills test also had high value-added scores on a test of reasoning, but it wasn’t many more. If you fired teachers who did poorly at teaching basic skills you would get rid of many teachers who did poorly at developing reasoning skills, but you would also get rid of many teachers who did well at developing reasoning skills. The first group (those who did poorly) would be larger than the second group (those who did well), but not much larger.

The second highly publicized study, done by a group of Harvard researchers, concluded that teachers whose students had high value-added test scores were also those whose students had better long term adult outcomes—better earnings, for example. This was a potentially important finding because it suggested that these tests had not become ends in themselves, but rather that success for students on these tests made the students more likely to be successful as adults, and if you put pressure on teachers to increase their students’ test scores you would also be putting pressure on these teachers to improve their students’ adult success. And that would be a good thing.

The flaw here is that the researchers were unable to compare the long term results of high value-added teachers with results of teachers who excelled in other ways that might, conceivably, have even larger impacts on long term outcomes. For example, the researchers could not say whether teachers who are more effective at developing their students’ cooperative behavior, or reasoning skills (and we know from the Gates study that only sometimes are these the same teachers who are more effective at teaching basic skills) might have students who have even better adult outcomes—like earnings. If this were the case (and we have no reason to believe it one way or the other), then getting teachers to shift their attention from teaching reasoning or cooperative behavior to standardized test preparation might be lowering their students’ future earnings, not raising them.

In short, the two recent studies most heavily promoted by supporters of the Chicago district’s plan to evaluate teachers in part by their students’ test scores do not confirm that the district’s position is wise. It may be, but it also may do great harm.

The Chicago district, and other promoters of teacher evaluation based in large part on student test scores, have become aware of these problems. And so they now emphasize that they support evaluating teachers by “multiple measures”—not only their students’ test scores but by the performance by students of assigned tasks under the supervision of experts, by observation of teachers by their principals, and sometimes (for high school students, for example) by student reports of teacher effectiveness.

This is a fine balanced approach in theory, but is very difficult to implement in practice. For example, when the Gates Foundation study also showed a correlation between a teacher’s value-added test scores and a rated observation by instructional experts, it conducted this experiment by providing the experts with videotapes of teachers conducting instruction. The experts watching (and evaluating) the videotapes did not know the value-added scores of the teachers on the tapes, so the two measures (value-added scores and expert observation) were independent. But according to the Chicago district proposal, the observations will be conducted by principals, who will know the value-added scores of teachers they are observing. How principals will be influenced by this knowledge cannot be known—will they tend to give high ratings to teachers with high value-added scores in order not to call attention to possible flaws in their observational skills, will they tend to offset value-added conclusions in order to save favored teachers who have low value-added, or will they tend to sink unfavored teachers with high value-added? One thing of which we can be certain: Armed with knowledge of teacher value-added scores, it will be much harder for principals to observe and evaluate teachers objectively. In times past, when student test scores did not have high stakes for schools or teachers, principals with knowledge of test results could use this knowledge constructively to guide their observations; principals would visit classrooms where test scores were poor to see if they could determine something being done poorly, or visit classrooms where test scores were good to see if they could learn what was being done right. With high stakes now attached to the test, such constructive evaluation is less likely.

With a rush to implement test-based accountability before these systems have been tested experimentally, or even thought-through carefully, the Chicago district proposal is in some respects silly. What about teachers who don’t teach math or reading and so who don’t have standardized value-added scores? Or those who have not had students for a full year, or who have not been teaching the same subject for sufficient time to have value-added scores? The district proposes to evaluate these using their school-wide average value-added scores. Perhaps, with this proposal, the district is acknowledging that a teacher’s impact on a student is not only the result of her own efforts, but of the school’s entire teacher corps, working collaboratively. But if so, then individual teacher evaluation-by-test-score makes no sense (even if individual teacher data happen to be available), and student growth data should be used only to evaluate schools as a whole.

The impact of teachers’ practice on each other is apparent. Should, for example, a fifth grade teacher’s value-added score be adjusted if her students had come from a class the year before with a fourth grade teacher whose value-added score was unusually high or low? With similar students, a fifth grade teacher will have an easier (or perhaps harder) time if her students had a more effective teacher in the fourth grade. It could be easier if students had a more effective teacher the previous year, because the skills students learn in one year can give them an advantage in learning in subsequent years. Or a teacher’s job could be harder if students had a more effective teacher the previous year, because students who learn more in one year will have less room to grow in the next year. Nobody, no educational theorist or practical policy maker has an answer to this problem, and the Chicago proposal ignores this obvious source of distortion.

News reports suggest that the Chicago strike may be settled soon. The district’s latest proposal is that value-added test score data could ultimately make up 25 percent of a teacher’s evaluation, with student growth on some yet-to-be-defined “performance” tasks—writing an essay, for example—comprising another 15 percent. This is far better than what some districts around the country are attempting to do, with standardized test score data making up half of a teacher’s evaluation. The union will likely agree to something close to the district proposal, with an appeals process that is stronger than what the district has thus far proposed.

Although the Chicago teacher evaluation system will not be the worst in the country, it will still rely on methods that are not yet ready for prime time. Whether the Chicago strike slows down the rush in other places to implement a terribly flawed system, or the settlement encourages other places to try it, remains to be seen.

Richard Rothstein is a research associate at the Economic Policy Institute and senior fellow at the Chief Justice Earl Warren Institute on Law and Social Policy at U.C. Berkeley.