Student achievement is typically reported as an aggregate test score, obscuring how students perform across individual concepts and skills. A new study by Jesse Bruhn, Michael Gilraine, Jens Ludwig, and Sendhil Mullainathan of the National Bureau of Economic Research (NBER) shows that analyzing item-level test data rather than just overall scores reveals valuable insight into student outcomes and teacher quality, delivering significant returns on investment for school districts.
Using 1.31 billion individual item responses from the State of Texas Assessments of Academic Readiness between 2012 and 2019, the researchers analyze performance for approximately 5 million Texas students in grades 3-8. They test whether item-level data reveals insights that aggregated performance data does not, and whether this information is useful to school leaders.
The researchers find that aggregated test scores produce an average ranking that implies some students and teachers perform consistently better across all content areas. Item-level data, by contrast, reveals meaningful variation in strengths and weaknesses. For example, a math teacher may be ranked second in her school based on the aggregation of her students’ scores, but item-level analysis could reveal that she is ranked first in geometry and third in trigonometry. Likewise, two students who each answer 50 percent of questions correctly may have mastered very different skills.
The study also finds that item-level analysis substantially improves the ability to predict longer-term student outcomes because certain test questions are more strongly correlated with certain outcomes. By linking student test responses to administrative data on student demographics, socioeconomic status, academic performance, graduation rates, college attendance, and earnings, the researchers find that, compared to the aggregated performance method, item-level analysis provides better understanding of which students are at-risk or on track for success across individual outcomes. As a result, using item-level data increases the accuracy in predicting earnings by 44.5 percent, high school graduation by 19 percent, disciplinary violations by 12 percent, college attendance by 6.5 percent, and class failure by 4.7 percent.
The researchers also conduct a policy experiment testing how item-level data could affect decision-making tied to student outcomes. For the cohort of students starting high school in 2016-17, they simulate replacing the bottom 5 percent of teachers based on value-added rankings generated from either their students’ aggregated performance or performance on the item-level test questions most predictive of high school graduation. Comparing the simulated outcomes to the cohort’s actual 2020-21 graduate rate, they find that the item-level approach would result in 100 additional high school graduates per year.
Because item-level data are already collected through existing testing systems, switching from an aggregated performance model would entail relatively low costs, the researchers argue. Beyond staff time, implementation would primarily require upfront investments, such as developing a new algorithmic code, creating a framework to link test questions and categories throughout the year, and acquiring computing resources.
Given this practicality—and the fact that test preparation is already a significant proportion of instruction time—the researchers encourage school leaders to make better use of the data they already collect by leveraging item-level information to inform decisions that directly impact student outcomes.
