The Case for (Smart) Standardized Testing

These remarks are adapted from the keynote address by Chris Cerf, the former New Jersey state commissioner of education on the status of school reform, particularly the role of standardized testing in school improvement, at a recent convening on measuring school performance hosted by the MIT School Effectiveness and Inequality Initiative. Cerf has also served as deputy chancellor of the New York City Department of Education under Mayor Michael Bloomberg, superintendent of the Newark Public School District and in leadership roles in several private-sector companies.

We live in a highly polarized world, a world where shades of gray and nuance have been replaced by stark divides—good or bad, right or wrong, enemy or friend. We see it everywhere from national politics to culture wars and in what some have decried as the end of any pretense of civility in public discourse.

If this polarization characterizes where we are today, the canary in the mine has been the debates about “school reform” that have raged over the past two, even three decades.

Are you “for” public education or are you a “privatizer?” Do you “support” teachers or do you “blame” them? Are you “for” tests and accountability or are they evils to be challenged at every turn? Do you “support” charter schools or “oppose” them? Are you “for” local control or do you support centrally mandated academic standards?

Every one of these of course is a false choice. The answer to each—as it is to most hard questions—is “it depends” on the particulars of the idea as applied to the specific circumstances in question.

But that is not how the debate is usually framed today. Indeed, in the blog wars and rhetorical excesses that dominate the press, the word “reformer” has become a malediction, a label dripping with opprobrium.

I find this remarkable given the vast and immoral inequities that exist today in educational outcomes. If there is nothing else we can find common ground on, can we not agree that our central national value is that birth circumstances—demographic, economic, or otherwise—should not determine life outcomes; that free, public education exists in no small part to equalize opportunity; and that, by any responsible measure, we as a nation have failed to fulfill that promise?

If we can agree on this, how can striving to reform a system that is not delivering on its core purpose be anything but a laudable goal?

I have spent most of my career in service of this goal. The teams I had the privilege to work with organized their reform efforts around four principal drivers.

Our philosophy was to be governance indifferent and quality focused. In both Newark and New York, for example, we closed or restarted numerous chronically failing schools and launched a number of both traditional and charter public schools, which among other things led to two of the top performing charter sectors in the country.

Second, we paired this with a deep belief in empowering parents. Every family in Newark had the opportunity to select the public school, traditional or charter, anywhere in the city that best met their child’s needs.

Third, we adopted an intentional and disciplined focus on educator effectiveness. Again, Newark illustrates the point. We consistently identified around 15 percent of the city’s educators as below effective and brought nearly 250 tenure charges—when in the prior decade tenure charges could be counted on the fingers of one hand.

Finally, we took curriculum seriously, instituting research validated, common core aligned reading and math programs—and trained teachers relentlessly on them.

These strategies made a significant difference in the lives of hundreds of thousands of children.

In New York City, study after study has shown that the body of work in the Bloomberg/Klein years yielded significant and sustained gains—despite subsequent efforts to dismantle it.

In New Jersey, we stuck to our guns on high standards, meaningful assessments, tenure reform, and accountability in both the traditional public school and charter public school sector. We also accepted responsibility, often at great political cost, for addressing the systemic and chronic failure in communities such as Camden and Newark. The results speak for themselves.

In 2017, New Jersey’s NAEP performance in 4th grade reading and math was tied for the highest in the nation and in second place in 8th grade. And, despite the small dip experienced across the country last year, New Jersey remains a top performer on every relevant metric.

New Jersey has achieved especially noteworthy results in some of our highest needs communities. In Camden, the governor’s courageous decision to take control of the district yielded impressive results. And in Newark, perhaps in its time the most politically charged education reform effort in the country, the work was an unambiguous success.

That may come as a surprise to those who form their education views by reading the New Yorker, but under state control, graduation rates rose 20 percentiles to just under 80 percent. Between 2010 and 2018 the percentage of African Americans who attended schools that beat the state average tripled. According to the Center for Reinventing Public Education, Newark now has more “beat the odds schools” than any city in the country.

On the rigorous PARCC exam, Newark’s Free or Reduced-Price Lunch students significantly outscored their counterparts in ELA not only in New Jersey but also in every other PARCC state, including Colorado, New Mexico, Illinois, Maryland and Rhode Island.   The percentage of students meeting or exceeding expectations more than doubled that of the District of Columbia. Newark students made similar gains in math/

While I am proud of the successes our teams achieved, there is also an ample supply of disappointments and failures.

An Illusive Consensus

High among the latter is the failure to forge a consensus around the central topic of this gathering: “How to Measure and Improve School Performance.”

Throughout my various posts I have always thought certain propositions related to accountability and measurement were so obvious, so beyond debate that one could reasonably assume that they were universally held:

That which is measured is what is done. If you care whether children can read at a certain level of complexity, for example, it follows that you should track to what degree that is in fact happening and, by so doing, behavior on the ground will organize around that goal—at the school, classroom, and student levels.

If you don’t know where you are trying to get, it is very unlikely you will get to where you want to go. In other words, the first task is to define what you mean by success and build your plans and strategies around those objectives.

Measuring success directly is usually better than trying to predict success via a proxy. Using reading again as an example, I have no doubt that reading is a directly measurable skill (even as I applaud the work in Louisiana and elsewhere to make the evaluation tool fairer and more meaningful). So rather than measuring things that might predict reading progress, measure it directly.

Transparency and information are positive goods that encourage intelligent decision-making, whether by a parent selecting a school, a teacher designing a lesson plan or a policymaker deciding what strategies to pursue.

Measurement, accountability and data-based decision-making in general can be fetishized to a point that their costs outweigh their benefits. Accordingly, great care must be taken to find the right balance—neither abandoning measurement and accountability altogether nor allowing them to become “counter-educational” in their application.

As it turns out, I was quite wrong that these propositions are universally embraced. Very much to the contrary. There is anything but a consensus on all of them.

This is most evident in the often-virulent anti-testing and opt-out movements. It is also evident in the absolutist position of the national teachers unions that there is no metric-based methodology for differentiating teacher effectiveness that is both fair and reliable—and never could be.

But it is deeper than that. The notion that “testing is bad” is now firmly embedded in the American psyche. And, thanks to the press’s perseverative and uncritical repetition, the phrase “teaching to the test” is accepted as the embodiment of poor educational practice.

Before I offer my perspective on why this anti-testing sentiment has gathered such momentum, let me make two points:

First, teachers invented tests. Anyone out there who says you can have high expectations for students and not have some way of measuring whether those expectations have been met has never spent a minute in a classroom. Effective use of assessments is a critical part of the teaching and learning process.

Second, can we please scrap the phrase “teaching to the test?” For four years, I taught what at the time was called AP U.S. History and Government, which as you know culminates in a several-hour exam drafted not locally but by a national organization, the College Board. I knew that at the end of the year my students would be asked to read a set of primary sources and to write a coherent, original and persuasive essay demonstrating their understanding of those documents in the context of a broader understanding of a particular historical era, the Depression, for example, or the origins of the Cold War. And that is what I taught them how to do.

You can malign that by calling it “teaching to the test,” but most educators would view it differently: as the very essence of quality teaching. Drill and kill is counter to sound educational practice. So is teaching test-taking techniques. But “teaching to the test,” if done correctly, is a positive and necessary element of the “standards” movement, perhaps the single greatest engine of educational progress as evidenced by its adoption by every economically advanced nation on earth.

The Anti-Testing Movement

So why has the virulent anti-testing movement achieved such prominence?    I think there are four reasons.

First, we do not have a common definition of success in public education.  If we don’t agree on what counts, it’s pretty hard to agree on how to count it.

Second, measurement and accountability, by definition, identify a hierarchy of success, which generates fierce opposition from those who fear being placed on the “low” end of the continuum.

Third, anti-testing extremists are prone to mischaracterizing the entirely valid concerns about testing and measurement advanced by policy makers and thoughtful critics who in fact are proposing responsible and balanced solutions.

Fourth, there is a tendency among scholars and policy makers alike to let an idealized “best” be the enemy of the “good.”

Imagine how simple a world we would live in if we had a national consensus about the floor to which we would hold all public schools. What if it were as simple as defining success as follows: By the time students graduated from high school they were literate and numerate at a level commensurate with post-secondary readiness (the so-called ‘gateway skills’). Sure, we would also require that they take a high quality, performance-oriented science course each year, that they are exposed to computer science, had three years of civics and history, and had the opportunity to take art in one or more of its many forms. That’s it. Everything else, including what and how to test, would be left to local decision-making and innovation.

States of course would continue to set standards. Local authorities, educators and parents, however,  could choose to measure and report whatever information they thought most useful—and, one hopes, they would choose wisely based on research and evidence.

Of course, states, which usually have superior data and analytic capacity, could continue to feed districts a wealth of information and could message the negative consequences of excessive district-mandated testing. But nothing except reading and math would be the subject of state-wide mandatory annual testing and reporting.

If we had a national consensus about that, the decision to evaluate ELA and math via annual tests, I hypothesize, would be as uncontroversial as it is in most economically advanced countries. And that would almost certainly be the case, as I will briefly touch on in a moment, if we did not use that data to evaluate individual teachers.

But no such consensus exists—and therefore the policy debate ricochets between the extremes of testing everything—or nothing. SEL? Science? Art? Civic engagement? Values? Safety? Predictives? Unit tests? Final exams? Surveys? Multi-factor school report cards? Dashboards? and on and on.

Would reading and math continue to receive disproportionate focus?  Probably, as well they should in some schools for some students—as these two disciplines unquestionably form the foundations for future post-secondary success. But there would be much more room for flexibility and far less pressure to relentlessly test and evaluate everything under the sun.

Our lack of a common definition of success, however, goes well beyond the absence of a consensus about what to test. More fundamentally, the nation is deeply divided on the very purpose of public education.

Whatever else one thinks about NCLB, there is great meaning in the name of the law: No Child Left Behind. That is a radical, and indeed completely ahistorical concept.

Over the decades, really since the middle of the 19th century, the purpose of public education has been all over the map: to impart democratic values;  to facilitate the melting pot; to educate the sons and daughters of the master class to take the reins of industry and government, while educating working class children to a level that would enable them to succeed in their assigned role fueling the engines of industrial production. And more . . ..

A More Radical Purpose

NCLB stated a different and more radical purpose, rooted in the ideal of equal opportunity that is enshrined in the Nation’s founding documents:  All children, regardless of birth circumstances, should be held to the same high standards, and we should regularly evaluate the degree to which they are achieving that goal.

To be blunt, one critical reason testing and measurement have generated such controversy is that broad swaths of Americans simply do not accept that premise, even if they pay lip service to it.

Low expectations, often with racist undertones, are epidemic: Often in whispered voice, but with alarming frequency, one hears, “These tests are ‘too hard,’ the goals too lofty to be realistic for many disadvantaged students.” And further, “The only way they (and derivatively, ‘we’) can succeed on them is to game the system with drill and kill, excessive test prep or worse.”

Of course, that is simply not true, as any number of counter examples show. And such a perspective is unquestionably tantamount to putting up a white flag of surrender to class and race differentiation, an unconscionable outcome. But if you believe some version of the view that “these tests are too hard for ‘these’ kids,” the easiest target is high standards and the tests that evaluate whether they have been met for all children.

And there is an inverse of this phenomenon for those who are (or who are educating) the more affluent end of the socio-economic continuum. If you are a family of means, who is not especially invested in the educational fortunes of the disadvantaged and whose child comes well loaded for success pretty much regardless of what his or her school dishes up, then this whole standards and accountability dance seems superfluous, distracting, and loaded with opportunity costs in the form of instructional time spent on tests. Worse yet, they may yield objective evidence that the school district you paid a fortune to live in isn’t as good as it is cracked up to be—or, heaven forbid, neither is your little Johnny or Mary as gifted as you had imagined. And who wants to hear that?

This absence of a national consensus around what constitutes success suggests a second, related reason why the “obvious” principles related to assessment enumerated above are far from universally embraced. Measurement and accountability, by definition, identify a hierarchy of success, and that generates fierce opposition from those who fear being placed on the low end of the continuum.

Certainly, the best example of this is the by now received wisdom that “reformers” (including me) made a mistake by simultaneously advancing higher, college-aligned standards with a push to use classroom- or school-specific achievement data as a component of determining educator efficacy. Whatever one’s views on the merits, there can be no doubt that this coupling generated a shrill, well-funded and largely successful attack on testing from the national teachers unions.

A third reason that common sense approaches to testing and evaluation have met with such opposition is a reflection of the epidemic of polarization that is especially pervasive in the school reform arena. 

There are of course entirely valid concerns about over-testing and relentless measurement. Indeed, that is one of the “obvious” principles set out above. When thoughtful critics suggest responsible approaches to rethinking our current approach to test-centered accountability, however, the anti-testing lobby often overstates their views in support of their more radical agenda—making it that much harder to undertake a serious effort to achieve a balanced solution.

Finally, we sometimes forget that all tests are imperfect. That reality has led to a tendency to let an idealized “best” be the enemy of the “good”—to reject testing because of its imperfections in the hope that some perfect alternative will let us have it all—while conveniently forgetting that a world without tests, even imperfect ones, has devastating consequences, especially for disadvantaged students, who are disproportionately children of color.

By way of example, here’s a testing regimen I know we could all endorse:  A rich curriculum with embedded pedagogy integrated into a digital platform that quietly generates a “data exhaust” that validly and reliably produces a summative view about student progress. No tests, just good teaching accompanied by useful conclusions. Sound pretty good?   How about we discover a handful of soft survey questions that, in similar fashion, validly and reliably serves as a surrogate for traditional summative assessments?

Great stuff, right? The problem is they don’t exist and there is not a lot of evidence that they will in the foreseeable future.

The inherent imperfections of our current tests and accountability systems demand continued refinement and extensive research and development. Let’s not forget, however, that the standards movement at its heart is a civil-rights movement. It seeks to assure that when one walks into an Algebra One class in the most disadvantaged school in Newark, the “expectations floor” is as high as in any school in the state. For many decades it absolutely was not: race and poverty resulted in a different and lower floor, an outcome our central national ideal compels us to reject.

If we are going to have high and equal standards, we must reject the facially absurd notion that such standards can exist without a mechanism to evaluate whether they are in fact being met. The “best” may not exist. The “good” is inherently imperfect. But abandoning evaluation altogether is the worst solution of all.

The Way Forward

Going forward I would suggest that we align our work around five principles.

First, reject the extreme dogma that tests are “bad.” Evaluating the degree that children are progressing towards the high standards they need to succeed in life is not an evil to be avoided but a positive element of any effective public education system.

Second, less is more. Require students to take a diverse array of courses and allow districts to measure and report whatever they think most useful but limit annual state or federal mandated assessments to reading and math. The decision to evaluate everything else, from SEL measures to science, to civics should be left to the discretion of each school community, who should be admonished to be very thoughtful about the overuse of assessments.

The testing critics are right that over the past decade or so the focus on testing, data, and evaluations overwhelmed other valid educational values. District officials and educators should make a conscious effort to seek greater balance, by limiting the number and breadth of assessments.

Third, information is meaningless if it isn’t used to change behavior. I am not a fan of sampling techniques that only generate aggregate data. Its nice to know if District X is making progress relative to other districts, but at minimum, we should know every year if a school, a grade within a school, or a demographic subset of students is lagging.

Fourth, and perhaps most controversially, I would be very tempted to give up on using student test data as a component of teacher evaluation. I’ve come full circle on this and have concluded that the costs outweigh the benefits. I would, however, continue to use school-wide test performance as one factor in decisions about school leadership, school closure or restructuring and “state takeover.”

Lastly, we should continue to invest heavily in developing improved mechanisms for generating information that can be deployed by educators and parents alike to maximize the prospects of equal educational opportunity for all students.

There is no denying that the arc of the school-reform pendulum has retreated from its high point during the Obama/Duncan years. I have little doubt, however, that it will swing back, albeit with important refinements borne of lessons learned. Those refinements, without question, will include improvements in our testing and accountability systems, which have often assumed an outsized and at times distortive role in the life of our schools.

My entreaty is that in pursuing those refinements we not abandon the central principle that all children should be held to high and equal standards. Especially in the gateway skills of reading and math, let us also not forget that such standards, unless accompanied by student-specific evaluation of whether they are in fact being met, are an empty promise.