How High the Bar?

Save to favorites
Print

Copy URL

In Holliston, Mass., a middle-class, college-minded suburb west of Boston, students are accustomed to taking standardized tests. And teachers like Kenneth L. Worsley, a longtime math instructor at Holliston High, usually review the test scores to get a handle on their students鈥� year-to-year progress.

But when results from Massachusetts鈥� tough new state exams began trickling in a little more than a year ago, it became apparent to Worsley that there was nothing usual about the state testing program.

Poring over the numbers, Worsley noticed that one 8th grader who was rated 鈥渁dvanced鈥� on mathematics on the Stanford Achievement Test-9th Edition, the nationally normed test that Holliston students ordinarily took, fell into the 鈥渇ailure鈥� category on the math portion of the state exam. Two others, both strong performers in the classroom, dropped from 鈥渁dvanced鈥� on the national test to 鈥渘eeds improvement鈥� on the Massachusetts test.

In all, 35 percent of the school system鈥� 8th graders had achieved advanced status on the Stanford-9. Yet, a few months later, only 3 percent of the same group of students had earned the same label on the state test.

鈥淲hat a test!鈥� Worsley wrote in an e-mail to colleagues around the state. How, he wondered, could two math tests given to the same group of students in such a short time span produce such different results?

Many explanations probably exist for the test-score differences the Massachusetts teacher noticed. The easiest is to dismiss the discrepancies as evidence that the two tests, though meant for students at the same grade level, do not cover identical material. But experts say the varied results may also illustrate something else: What is meant by 鈥渉igh standards鈥� is largely in the eye of the beholder.

Across the country, states have been moving to hold schools, students, and educators accountable for their performance. As they do so, states must decide how high to set the bar and how fast they can expect students and schools to improve.

Although established technical procedures exist for determining passing scores and the like, the final decision about where states set the academic bar in their accountability programs is, in the end, a judgment call. Someone--usually several groups of people--has to pin down what to test and how difficult to make test items, where to set cutoff scores for who passes and who get labeled 鈥減roficient鈥� or 鈥渋n need of improvement,鈥� how much test-score improvement is realistic to demand from schools, and how long it should take to get there.

鈥淩egardless of the process, it鈥檚 always a decision of judgment, and people believe there鈥檚 an exact science to creating a cut-score point, but there鈥檚 not,鈥� says Catherine L. Horn, a research associate for the National Board on Educational Testing and Public Policy, based at Boston College.

But as increasing numbers of states add teeth to their testing programs, such decisions are becoming critical. In Kenneth Worsley鈥檚 state, for example, a few scale-score points can mean the difference between graduating on time or putting in another year of school for students in the class of 2003. And making sure that academic expectations are high, yet realistically attainable, can mean the difference between the ultimate success or failure of a state鈥檚 system of standards and testing.

Math teacher Kenneth L. Worsley was perplexed when his 8th graders who scored in the 鈥渁dvanced鈥� category on one national exam failed or were deemed 鈥渋n need of improvement鈥� on the Massachusetts test.

鈥淚t鈥檚 hard to know in the scheme of setting expectations what鈥檚 the right thing to do,鈥� says Brian M. Stecher, a senior social scientist for the RAND Corp., a Santa Monica, Calif.-based think tank. 鈥淚f you set them too low, that鈥檚 not going to lead to closing the achievement gap. If you set them too high, you鈥檙e encouraging people to 鈥榞ame鈥� the system.鈥�

Discrepancies Across Exams

In building the Massachusetts Comprehensive Assessment System, known as MCAS, policymakers staked out the high end of the academic-challenge spectrum.

鈥淚t鈥檚 a very, very hard test,鈥� says Worsley of the Holliston district, which enrolls about 3,000 students. 鈥淚 don鈥檛 know how the inner-city schools in our state are ever going to get there.鈥�

The idea was to make the state鈥檚 academic standards and its schools 鈥渨orld class.鈥�

鈥淚t鈥檚 something to stretch for rather than something that simply validates the existing curriculum,鈥� says James A. Peyser, the chairman of the state school board.

Worsley was not the only educator in the state to notice that there were sometimes big differences in how students ranked on the MCAS tests and the labels they were given on other measures.

Horn and her colleagues from the Boston College testing center did a similar, though more sophisticated, analysis using scores from four Massachusetts districts. In all four, students had also taken at least one standardized test in addition to the state tests in the same year. Those tests included the Stanford-9, Educational Records Bureau exams, and the Preliminary SAT.

Looking first at students鈥� scale scores on the exams, the researchers found few surprises. As might be expected, students who did well on the state tests also scored high on the other measures.

But the Massachusetts tests also assign students labels based on where they fall along an 80-point continuum. A scale score of 200 to 220, for example, signifies failure, while a student scoring 221 is classified 鈥渋n need of improvement.鈥� Higher scorers are labeled either 鈥減roficient鈥� or 鈥渁dvanced.鈥�

The problem was that students scoring at advanced levels on the comparison tests wound up in all four categories on the MCAS--much as Worsley鈥檚 students did.

鈥淲hen you take an 80-point continuum and reduce it to four points, that becomes problematic,鈥� Horn says. 鈥淎nd, at least in Massachusetts, the focus is really not on scale scores. It鈥檚 on performance levels because they鈥檙e so easy.鈥�

But Massachusetts officials say the more important point to keep in mind is that all the tests studied were highly correlated, because the top performers on the state test also did well, for the most part, on the other tests.

鈥淛ust because one student scores high on the [Stanford-9] but does poorly on the MCAS doesn鈥檛 tell you anything about correlation,鈥� says Peyser.

What鈥檚 more, he notes, the two are completely different kinds of tests. 鈥淥ne is a criterion-referenced test based on published standards, and one is a norm-referenced test based on no published standards. If they were the same tests, we鈥檇 be wasting our money on MCAS when we could buy the Stanford tests off the shelf,鈥� Peyser says.

While the variation found in the test scores may give the impression that policymakers were pulling the cutoff scores for their tests out of thin air, that was hardly the case. Like most states with student-testing programs, Massachusetts used some well-established test-development procedures to determine where to draw academic distinctions.

鈥淎lmost all of these tests will get challenged in court,鈥� says P. Uri Treisman, who, as the director of the Charles A. Dana Center at the University of Texas at Austin, has watched Texas鈥� accountability system evolve. 鈥淎 meaningful part of court proceedings, something all state agency people know, is that you have to have your psychometrics together,鈥�--meaning the statistical underpinnings of the test design are sound.

In it process, Massachusetts policymakers used a performance-level-setting procedure known as the 鈥渂ooklet classification鈥� method. In every subject area, 20-member panels made up of teachers, administrators, and community representatives spent two days reviewing examples of student responses to test questions, says Jeffrey M. Nellhaus, the state testing director.

Their task was to decide whether the work in the test booklet represented a minimal understanding of the content tested, partial mastery of the material, solid understanding, or comprehensive, in-depth understanding.

鈥淓ach panelist classified the same set of booklets,鈥� Nellhau explained. 鈥淪ince we know the raw scores on the booklets, we could then establish the cut scores.鈥� The state board, in turn, adopted the panels鈥� recommendations.

High Failure Rates

To the south of Massachusetts, Virginia--which has also set it academic goals high--relied on a 30-year-old method known as the 鈥淢odified Angoff鈥� procedure to set the passing scores for its new state tests.

That approach centered on 20-member, geographically balanced committees that included teachers, curriculum experts, and school administrators. Committee members were shown test items and asked to judge the probability that a minimally competent person would get each right. By averaging those verdicts, the committees came up with ranges of passing scores, which were sent on as recommendations to the state board of education.

Most of the time, the Virginia state school board chose scores from the high end of the range. In two case, it exceeded committee recommendations.

The result in both Massachusetts and Virginia was a testing program with some very high hurdles for either schools or students to jump over.

In 1998, when the Massachusetts test results were reported for the first time, 81 percent of 4th graders were either failing or in need of improvement on the English/language arts exam; 71 percent of 8th graders fared just as poorly on the science/technology tests; and 74 percent of 10th graders got failing or needs-improvement ratings on the math test.

In Virginia, 98 percent of schools were given failing marks on the first administration of the Virginia Standards of Learning tests in 1998. But on some of the tests, such as 8th grade science, as many as 71 percent of individual students were earning passing grade that year. Last year, the percentage of individual students passing the tests ranged from 39 percent in 10th grade U.S. history to 85 percent in writing.

The number of schools labeled 鈥渁ccredited with warning"--the lowest possible grade on the tests--dropped to 12.8 percent.

The initially high failure rates prompted protests against the testing programs in both states. In Massachusetts, teachers said they were worried about the potentially harmful effect of describing so many students as somehow deficient--particularly minority students who have traditionally scored lower on standardized tests.

鈥淢y concern with labeling students so young is will we have increased dropout rates?鈥� Worsley remarks.

Hundreds of students, most of them from suburban western Massachusetts, boycotted the MCAS tests altogether last spring. The protesters represented only a small fraction of the 220,000 4th, 8th, and 10th graders scheduled to take the tests, however.

Still, the protests have not disappeared. On this past Election Day, for example, voters in six, mostly urban districts approved a nonbinding resolution to suspend plans to use the test as a graduation requirement.

In Virginia, where penalties for poor performance are still years away, protests were more muted. But the state school board last July extended from 2001 to 2004 the date by which students will have to pass the state tests to graduate. Schools now have until 2007, rather than 2004, to get their students鈥� passing rates up to 70 percent in order to avoid losing their state accreditation.

鈥淵ou can鈥檛 set the bar and expect everybody to jump over it in the same period of time with the same basic instruction,鈥� says William C. Bosher Jr., Virginia鈥檚 former state superintendent. 鈥淚 believe the Virginia board of education is making adjustments that will enable the time to be flexible while not forgoing the standards.鈥�

Inch by Inch

But the transition to higher academic standards might be less painful, some observers have argued, if state policymakers set a lower academic bar in the beginning.

鈥淭here鈥檚 an axiom that if your constituents can鈥檛 meet a requirement, you鈥檇 better not pass it into law,鈥� says Treisman of the Dana Center in Texas. 鈥淪ome legislators are sensitive to that, and others set the standards so high as to violate that axiom.鈥�

In contrast, he says, Texas policymakers in 1993 set low passing scores for the Texas Assessment of Academic Skills and then raised the bar, inch by inch. State school officials notify districts of the upcoming changes to the testing program up to two years in advance.

鈥淭he fact that the state was able to set passing standards and ratchet them up five points every year was the genius of the system,鈥� Treisman says. Even with rising standards, overall student passing rates on the tests have increased from 53 percent in 1994 to 80 percent last spring, with some of the biggest gains coming among minority students. (In the Texas system, schools have to demonstrate that learning is improving for their minority populations as well as for their entire enrollments.)

But that system, known as the TAAS, has had its share of detractors, too. Hispanic and black students who had failed the state鈥檚 high school exit test brought an unsuccessful lawsuit against the state in 1999. Citing passing rates that were two-thirds those for white students, minority groups contended that the testing program was unfairly stacked against them.

Some Texas teachers, meanwhile, have argued that the push to do well on the tests is effectively narrowing the curriculum for all students.

鈥淣o one state has gotten all of this right straight out of the chute,鈥� says Jim Watts, the vice president for state services for the Southern Regional Education Board, an Atlanta-based group that promotes school improvement in 23 states. 鈥淭o some extent, it doesn鈥檛 get real until it鈥檚 real.鈥�

Policymakers in Kentucky, for example, overhauled that state鈥檚 6-year-old accountability system in 1998, going so far as to replace some tests, Now, schools have until 2014 to reach a score of 100 on an index that is based on improving dropout and retention rate as well as test scores.

Under the old system, schools were given test-score targets to meet in 20 years. They were expected to reach one-tenth of the distance toward their targets every two years, and rewards and punishments were meted out based on their progress. High-achieving schools complained they were unfairly penalized because they were topping out on the tests.

The new system sets the same target of 100 for everyone and plots a growth line for schools to follow as they move toward that goal. Schools are designated to be 鈥渋n reward鈥� or 鈥渋n assistance鈥� based on how far above or below that line they fall.

Whether 16 years or 20, the target date is an arbitrary number, supporters and critics agree.

鈥淚t鈥檚 a policy decision made by the legislature based on what鈥檚 reasonable and what do taxpayers think is reasonable,鈥� says Robert F. Sexton, the executive director of the Prichard Committee for Academic Excellence, a citizens鈥� group that promotes school reform in Kentucky.

Even that long timeline, however, is too short, say some superintendents. 鈥淚t鈥檚 our belief that only 30 percent of the schools here in Kentucky can make that mark,鈥� says Stephen Daeschner, the superintendent in Jefferson County. With 96,000 students and the city of Louisville in its domain, his district is the state鈥檚 most urban. 鈥淚t鈥檚 not realistic,鈥� he contends.

State officials, for their part, say it鈥檚 too soon to tell whether Daeschner鈥檚 projections have any merit because the new system is just getting under way.

Achievement Underestimated

In contrast, California school officials may have underestimated the number of schools that would qualify last year as having raised their achievement-test scores. State officials had predicted that educators in 60 percent of schools would be entitled to bonuses of up to $25,000 each because of their schools鈥� gains. But when scores were calculated last October, two-thirds of schools-67 percent--had met their target goals.

The new program is primarily based on results from the Stanford-9. Schools in 1999 were each given a baseline score, and their improvement targets were set at 5 percent of the difference between that starting number and a statewide performance target of 800.

The 800 target is an interim number based on data from the Stanford publisher projecting that 10 percent of students across the state could score at that level. 鈥淧eople just sort of accepted the fact that if you鈥檙e in the top 10 percent of anything, that鈥檚 probably a good thing,鈥� says William L. Padia, the director of the office of policy and evaluation for the California education department.

State lawmakers came up with the 5 percent figure for the improvement target; expert panels were appointed to figure out 5 percent of what. Should the formula be 5 percent of the previous year鈥檚 test scores? Or 5 percent of the average test-score growth across the state?

鈥淭hey attempted to balance a number of things,鈥� Stecher of RAND says. 鈥淭hey wanted to put the greatest incentive on the schools doing the poorest, and the 5 percent of the distance to the target metric does that.鈥�

It also helped, says Padia, that state school board members knew, in adopting their new targets, that the legislature had given them the authority to make changes later as the testing program evolves.

Experts predict that such adjustments, in fact, will inevitably occur in most states.

If setting cutoff scores for performance levels is an inexact science, adds Stecher, determining what kinds of academic-growth expectations are reasonable for schools can be pretty cIose to an educated guess.

鈥淲e have a lot of history now with regard to setting passing rates for licensing exams or professional certification,鈥� he says. 鈥淲hat we have less history on is setting standards for gains or improvements.鈥�

But to policymakers, the bigger mistake would be to have no goals at all to which students and schools could aspire.

鈥淵ou may never get there if you don鈥檛 set them high,鈥� Sexton of Kentucky鈥檚 Prichard Committee says. 鈥淵ou can give us every research-based argument as to why that can鈥檛 happen, and we will ignore them all because we will not go to the public and say we can鈥檛 educate your child.鈥�

Debra Viadero

Assistant Managing Editor, 澳门跑狗论坛

Debra Viadero was an assistant managing editor for 澳门跑狗论坛.

In March 2024, 澳门跑狗论坛 announced the end of the Quality Counts report after 25 years of serving as a comprehensive K-12 education scorecard. In response to new challenges and a shifting landscape, we are refocusing our efforts on research and analysis to better serve the K-12 community. For more information, please go here for the full context or learn more about the EdWeek Research Center.

A version of this article appeared in the January 11, 2001 edition of 澳门跑狗论坛