When Bad Things Happen to Good NAEP Data

Save to favorites
Print

Copy URL

Includes updates and/or revisions.

The National Assessment of Educational Progress is widely viewed as the most accurate and reliable yardstick of U.S. students鈥� academic knowledge.

But when it comes to many of the ways the exam鈥檚 data are used, researchers have gotten used to gritting their teeth.

Results from the venerable exam are frequently pressed into service to bolster claims about the effect that policies, from test-based accountability to collective bargaining to specific reading and math interventions, have had on student achievement.

While those assertions are compelling, provocative, and possibly even correct, they are also mostly speculative, researchers say. That鈥檚 because the exam鈥檚 technical properties make it difficult to use NAEP data to prove cause-and-effect claims about specific policies or instructional interventions.

鈥淚t鈥檚 clearly not NAEP鈥檚 fault people misuse it, but it happens often enough that I feel compelled to call [such instances] 鈥榤isnaepery,鈥� 鈥� said Steven M. Glazerman, a senior fellow at Mathematica Policy Research, a Princeton, N.J.-based research and policy-evaluation nonprofit.

鈥淣AEP is so tempting, because it has very wide coverage,鈥� he said. 鈥淏ut what it tries to do is actually pretty modest, pretty narrow. And that鈥檚 a good thing.鈥�

Often called 鈥渢he 鈥渘ation鈥檚 report card,鈥� NAEP represents the achievement of a nationally representative sample of students at three grade levels: 4, 8, and 12. Under the Elementary and Secondary Education Act, each state receiving federal Title I funds also must participate in the exam at the 4th and 8th grade levels in reading and math every two years.

Advocates are fond of making claims about what data from the National Assessment of Educational Progress mean, but not all of them stand up to scrutiny.

Use of Data:

鈥淧ublic education is supposed to be the great equalizer in America. Yet today the average 12th grade black or Hispanic student has the reading, writing, and math skills of an 8th grade white student.鈥�

鈥擣rom a 2009 Wall Street Journal op-ed written by Joel I. Klein, then the chancellor of the New York City school system, and the Rev. Al Sharpton

Problem:

NAEP scales differ by subject and grade.

Because of that stipulation, achievement trends across states can be compared, an impossibility using the results of states鈥� own hodgepodge of exams.

Twenty-one urban districts also volunteer to have their students鈥� results reported through the Trial Urban District Assessment, or TUDA.

NAEP data are generated through a technique known as matrix sampling. Some questions are given to each participating student; no child takes a 鈥渇ull鈥� exam.

Contrasting Assertions

In a sense, what has made NAEP unique in the annals of testing鈥攊ts commonality and independent administration in an era of cheating scandals鈥攈as also rendered it susceptible to misinterpretation and misuse.

鈥淭he NAEP exams have good measurement quality and assess subjects other jurisdictions don鈥檛 have assessment data on,鈥� said Sean P. 鈥淛ack鈥� Buckley, the commissioner of the U.S. Department of Education鈥檚 National Center for Education Statistics, which administers the exam. 鈥淭hey are comparable across state lines, which is unusual, and they are well known in the policy world. And unlike trying to negotiate with states and [privacy laws], NAEP data are right there on our website.鈥�

The downside is that examples of 鈥渕isnaepery鈥� are legion.

Use of Data:

鈥淎mong these low-performing students [on 2009 NAEP in reading], 49 percent come from low-income families. Even more alarming is the fact that more than 67 percent of all U.S. 4th graders scored 鈥榖elow proficient,鈥� meaning they are not reading at grade level...鈥�

鈥擣rom advocacy organization StudentsFirst鈥檚 website

Problem:

NAEP鈥檚 definition of 鈥減roficient鈥� is based on 鈥渃hallenging鈥� material and is considered harder than grade-level standards.

During the height of implementation of the No Child Left Behind Act, the most recent rewrite of the ESEA, dozens of press releases went out from the Education Department, then headed by Margaret Spellings, attributing gains on NAEP to the effects of the law.

In the District of Columbia, promoters of the policies instituted by former Chancellor Michelle A. Rhee as evidence that student achievement improved under her watch.

On the other hand, a 鈥攁 coalition housed in the Education Policy Institute, a left-leaning think tank鈥攄rew on the data to support the exact opposite conclusion. And Broader, Bolder鈥檚 claims that increased access to charter schools, teacher evaluations tied to student test scores, and school closures in the District of Columbia and two other cities didn鈥檛 lead to improvements for poor and minority students were picked up and repeated by influential education figures.

鈥淭he lesson of the new report: Billions spent on high-stakes testing have had minimal to no effect on test scores,鈥� New York University education historian about the paper. 鈥淗igh-stakes testing has failed.鈥�

Statistics 101

Use of Data:

鈥淲hen the market-based policies at the center of the reform agenda play out in a comprehensive manner across many years, the results, as captured in reliable data, are not encouraging. ... Reforms that produce a lack of progress on improving test scores or closing achievement gaps are no different from the 鈥榮tatus quo鈥� that they purport to break.鈥�

鈥擣rom 鈥淢arket-Oriented Reforms鈥� Rhetoric Trumps Reality,鈥� by the Broader, Bolder Approach to Education coalition

鈥淚n Charlotte, N.C., and Austin, Texas, both cities in right-to-work states where collective bargaining is not required, students in 4th and 8th grade are performing higher than the national average in both reading and math.鈥�

鈥擣rom 鈥淐ollective Bargaining and Student Academic Achievement,鈥� by the American Action Forum

Problem:

Both statements imply that specific policies affected scores, but casual conclusions are difficult to validate using NAEP.

Most such claims suffer, researchers say, from failing to consider that a correlation or relationship between two points of data does not prove causation.

鈥淭hey鈥檙e committing the fundamental and almost inexcusable error of leaping to the causal conclusion they prefer, when hundreds of others are possible,鈥� said Grover M. 鈥淩uss鈥� Whitehurst, the director of the Brown Center for Education Policy at the Brookings Institution and a former director of the Education Department鈥檚 research wing.

Another spurious use: treating NAEP data as though they track the same students鈥� progress through school. Such longitudinal data generated from state tests are frequently used by statistical researchers, who can take into account students鈥� background characteristics to control for the effect of poverty or family education on scores.

But NAEP data represent repeated cross-sectional snapshots of achievement, not the progress of individuals, making it more challenging to institute such controls.

鈥淚 can understand why people think if test scores go up, it鈥檚 because schools get better,鈥� said Matthew Di Carlo, a senior fellow who writes about education research for the Albert Shanker Institute, a think-tank affiliated with the American Federation of Teachers. But with NAEP, 鈥測ou鈥檙e comparing two different groups of students and assuming they鈥檙e not changing over time.鈥�

Use of Data:

鈥淲e subtracted the percentage of students in the state who scored proficient or better from the state NCLB test from the percentage of students in that state who passed the NAEP, and used this difference (or gap) to align each school and district test scores across the nation.鈥�

鈥擣rom real estate website NeighborhoodScout

Problem:

NAEP cannot be used to generate comparable school results.

Claims compiled by Stephen Sawchuk.

Some misuse occurs entirely outside of policy contexts.

鈥淭he states see this happening more than even we do nationally,鈥� said Cornelia Orr, the executive director of the National Assessment Governing Board, the body that sets policy for NAEP.

For instance, she said, 鈥渢hey鈥檙e concerned about real estate companies and how they abuse their own state test data, and they鈥檙e concerned it will happen with NAEP.鈥�

New Techniques?

The issue has been sufficiently worrisome that a joint task force of NAGB and the Council of Chief State School Officers began to catalog it in 2009-10.

Scholars say that it is possible to conduct high-quality studies using NAEP data, but doing so appropriately requires research expertise beyond what most lobbyists and policy analysts possess.

鈥淣AEP is just an outcome measure. It鈥檚 no different from an IQ test or the number of teachers with advanced degrees,鈥� Mr. Whitehurst said. 鈥淭he ability to draw causal inferences about any education variable depends not on NAEP, but on the quality of the research design for which NAEP is the outcome measure.鈥�

High-quality studies have drawn on NAEP results, for example, to estimate , Mr. Whitehurst noted.

Mr. Glazerman cautioned, though, that such studies are few and far between.

Advocates鈥� desire to seek quick confirmation for their policy prescriptions鈥攅specially when they are gaining or losing momentum鈥攎eans that it鈥檚 unlikely that interest in using NAEP for policy analysis will end anytime soon.

鈥淭here is just this unwillingness to accept that policy analysis is difficult, takes a long time, and often fails to come to strong conclusions about individual policies,鈥� said Mr. Di Carlo.

Over time, the difficulties inherent in interpreting NAEP results have even posed challenges for NCES and NAGB, which must weigh how to report and disseminate data from the exam to minimize misinterpretations.

The NCES itself has on occasion produced reports that include correlations, Mr. Buckley noted. Even when accompanied with caveats, he said, they have been misinterpreted in press accounts.

Still, Mr. Buckley said, the benefits of NAEP data far outweigh the harm that accompanies ill usage.

鈥淲e鈥檙e not the country鈥檚 education data police,鈥� he said of the NCES. 鈥淲e want the data to be useful, and we trust that the marketplace of ideas will drive the bad uses out.鈥�