Mindful of H.L. Mencken’s observation that “there is always an easy solution to every human problem—neat, plausible, and wrong,” let us urge the new Obama administration to avoid making the mistake of previous administrations in equating accountability in education with high-stakes test scores. There is increasing evidence that flaws in current test design should all but disqualify their continued use as metrics of accountability, especially in science and mathematics education.
To help us head off a potential collapse of trust in public education comparable in scale to the collapse of trust in our financial system, we might look to draw parallels from what we are learning with the economy. In particular, the closure of Bernard Madoff’s fraudulent investment firm stands to teach us at least four basic lessons we might use in reflecting on the role high-stakes testing has in driving current education reform.
A first lesson is that the most compelling evidence for something’s being wrong is often hidden in plain view. Consistent investment returns of 10 percent or more can’t be real, and in Mr. Madoff’s case, they weren’t. Similarly in education, there is mounting evidence in plain view that our current approach to high-stakes-test design can’t tell us what we need to know in order to drive education reform.
Separate from the question of whether any one test can give a complete picture of what a student knows or what he or she has learned in a given year—the answer to which is obviously no—there is the more precise question of whether, empirically, the tests work as good measures of what a teacher has done during a given school year. The answer to that question is also no.
Using student scores from the , or TAKS, our university-based research group has analyzed both the effectiveness of some specific reform projects in mathematics and year-to-year scores from the entire state in science, math, social studies, and English. For the most part, we have found the TAKS tests to be what W. James Popham of the University of California, Los Angeles, calls “insensitive to instruction.”
This means that even in situations where sensitivity to instruction is most implicated—those where there is a sustained, aggressive, high-quality, and content-focused intervention—most of a student’s score on the high-stakes TAKS (more than 70 percent of the variance) is predicted by the previous year’s math scores (with at most only 7 percent to 8 percent of the variance related to the intervention). We have checked with colleagues involved in mathematics interventions from around the country, and their results with similar tests are comparable.
We also found in our study that the predictive power of previous math scores holds up over a number of years of math testing, not just for the year prior. Even across subject areas, test scores predict other test scores in ways that are very likely to overwhelm the effects that any teacher could be expected to have in any given year.
For reform-oriented accountability to work, test scores need to be highly sensitive to what educators do. Instead, we have tests made up of items selected for their ability to consistently sort students, year in and year out, in the same order relative to an increasingly cross-test, cross-year, and even cross-domain psychometric “profile” developed by the testing organizations. (An example would be the location of students, in terms of an ability construct, on a logistic curve.)
These profiles emerge as an artifact of how items are selected. Test developers include in their respective proprietary item pools only those items shown to sort students in the same relative order in terms of their likeliness of getting an item correct. (In other words, ideally for each item in a given area, Student Q should always be more likely to get it right than Student S.) When high-stakes tests are then assembled using only the items that fit with these internal sorting profiles, the tests themselves also end up being remarkably robust in keeping students in the same relative order in terms of their overall scores (Student Q’s overall test score is very likely to be higher than S’s).
Using this approach, test scores will continue to predict other tests scores in ways that will remain remarkably insensitive to the quality of content-specific instruction. And just one of the unintended consequences of this insensitivity to instruction may be that those schools feeling the most pressure to improve test scores will resort to emphasizing test-taking skills, as opposed to meaningful academic content, as a compelling alternative strategy for attaining immediate, if short-lived, results.
Needless to say, these findings are highly problematic for outcome-driven reforms.
A second important lesson the Madoff scandal teaches us is that, for misrepresentation to work at a large scale, people’s desires and, even more so, their fears need to be played to, often by appeals to highly specialized forms of expertise or insider knowledge.
Perhaps no single piece of recent domestic legislation speaks more directly to our hopes and fears as a nation than the No Child Left Behind Act, with its goals to improve both equity and the levels of excellence in education.
The fact, then, that these largely self-confirming testing profiles align so consistently with existing inequities related to socioeconomic status, race, or first language only serves to underscore how problematic our findings are. That the math tests in Texas are now being validated, in the name of predicting “college readiness,” with what historically have been tests of “aptitude” (such as the SAT)—tests that have comparably problematic outcomes along these same dimensions—makes it even more likely that our high-stakes tests in math and science will reinscribe the sorts of inequities the No Child Left Behind legislation was meant to address.
A third lesson Madoff teaches us: If you want to forestall the day of reckoning, make sure you are in charge of both generating and then interpreting your own metrics.
Currently, only a handful of private organizations and companies operating in the United States have large banks of proprietary test items developed, and calibrated, in terms of fit with each of the organization’s own internal statistical profiles. Consequently, only these organizations have the ability to produce tests that can be used to evaluate our movement toward the psychometrically defined goals of the No Child Left Behind law. Test publishers are essential both to ongoing test construction and to the interpretation of the results for nearly all of the high-stakes tests developed in this country.
With affiliates of these same publishers also controlling the lion’s share of the textbook market here in Texas and around the country, one might legitimately begin to wonder how, when it comes to the academic side of schooling (as opposed to school financing), anyone would continue to describe the U.S. education system as locally (or even publicly) controlled.
The fourth Madoff lesson is to surround oneself with true believers. Reputations have to be on the line, and this will make coming to grips with what is really going on that much harder. Some have speculated that even Bernie Madoff, at some early point, might have believed in his own seeming successes.
Those of us deeply involved in reforming science and mathematics education, and who might once have wanted to believe in the potential of testing as a blunt but necessary instrument of reform, are now forced to come to grips with the full implications of the tests’ insensitivity to instruction in a way that vastly diminishes the role we can hope them to have as instruments of reform. We were wrong to help sell the idea of placing so much trust in institutions that, in retrospect, stood to benefit the most monetarily from our continued willingness to suspend disbelief.
Our professional reputations are indeed on the line, making this perhaps the toughest lesson the collapse of the Madoff empire has to teach about the current state of high-stakes testing. Responsible, rigorous, and transparent alternatives do exist. For us to make accountability work, however, we need to hope the new administration can learn from past mistakes, well before belief in public education’s ability to serve the purposes of a just, economically robust, and democratic society is lost.
As with the economy, in education we can do much better—but only if we learn the lessons for which our children might someday be expected to hold us all accountable.