Florida’s public schools have been under attack in the last 10 years, with a strong focus on accountability. The primary measurement used to ensure this accountability is student scores on standardized tests, with the assumption that these tests are accurate measures of student learning, teacher effectiveness and overall school quality. Scores on the Florida Standards Assessment (FSA) are used to evaluate teachers, administrators, schools and districts. Other high stake uses of the FSA include potential student retention and graduation decisions. There is, however, little to no agreed upon evidence that the test is an accurate measure of quality, leading to high stakes decisions (including teacher and administrator job losses and school closings) being made based on erroneous information.
The Validity of Standardized Tests
Standardized tests, in general, are not reliable sources for such high stakes decisions. Decades of research proves this:
- Why Standardized Tests Don’t Measure Education Quality, by W. James Popham, UCLA Emeritus Professor.
- A study from MIT, Harvard and Brown University indicated that high standardized test scores do not translate to better cognition.
- Research from a UCLA professor shows why standardized tests don’t measure educational quality.
- A Stanford University study shows how stereotypes prevent standardized tests from accurately measuring student performance.
- This article references a whitepaper from the Central Florida School Board Coalition about the misuse of standardized tests in Florida, including examples of how changing cut scores and introducing new calculations such as learning gains significantly affects outcomes.
- Research conducted by a Professor at Arizona State University with a PhD in Educational Psychology focusing on testing statistics and research. His interpretation is that there are many unaddressed reasons that standardized tests are not valid, and therefore it is dangerous to place a large amount of weight or stakes on the tests, especially if they are not corroborated by other more valid measures such as class grades.
- Research from a Professor at Northwestern University, saying that public policy makers are placing too much emphasis on standardized tests as a valid measure of student performance, noting that many of the technical reasons the tests are not valid are often ignored when decisions are made about their use.
- Research from Washington State University tying parental income to standardized test scores.
- Researchers use census data to accurately predict test results, indicating a clear connection to factors outside of school control.
- A study conducted by several faculty members at U.S. universities indicating that standardized test scores have gender and ethnicity bias.
- A study from a professor at the University of Illinois concluded that college readiness decreases when schools focus on test scores.
- A nine-year study by the National Research Council concluded that the emphasis on testing does not significantly increase learning, but actually causes harm.
- Research from professors from Bates University indicates that SAT and ACT tests are not valid indicators of future success.
- A political scientist at the University of Massachusetts presents research about how the increase in standardized testing is driving parents away from their schools.
There is also research indicating why standardized test scores are not accurate measurements of teacher quality:
- American Statistical Association says VAM (value-added method) is not a reliable way to measure teacher quality. While the state-mandated VAM requirement was removed in 2017, many districts are still using it because the state statute still requires that teachers be evaluated based on student progression.
- Research published by the Economic Policy Institute presents the problems with using student test scores to evaluate teachers.
- Fairtest.org presents strong research, with sources noted, about why standardized tests are not a valid measure of teacher quality.
- This article includes a video from Florida teacher, Luke Flynt, with hard facts about flaws in the state’s value-added-measurement (VAM) system, which ties teacher performance evaluations to test scores, noting specific instances of how his students’ higher than perfect scores actually brought his evaluating rating down.
- Professor David Berliner at Arizona State University demonstrates why standardized tests reflect the demographics of the students who are tested and ignore teacher behavior.
- Working paper from a professor and student at Brown University indicating that increasing test scores do not correlate to quality teaching.
The Validity of the FSA
If test scores are being used for high stakes decision, it stands to reason that the test itself should be proven valid. In 2015, the Florida Commissioner of Education testified that the FSA had been validated in Utah, where many of the test bank questions were created. When asked to deliver reports from that validation, no such reports surfaced. Legislators rightfully ordered an independent verification of the FSA in order to test its validity and ensure it is being used for its accurate purposes. Although not truly independent, (the company selected to conduct the study, Alpine Testing Solutions partnered with EdCount. EdCount is a partner of the original test creator, AIR, and the project team included many AIR employees), the study did bring forth several material discrepancies questioning the validity of the test [view the full report here:]:
- Although one of the purported purposes of the Florida Standards (which the test is designed to measure) is to increase rigor in schools, the validity study noted that the administration of the FSA does not meet that qualification: “The evaluation team can reasonably state that the spring 2015 administration of the [Florida Standards Assessments] did not meet the normal rigor and standardization expected with a high-stakes assessment program like the FSA.” (Validity report, page 14)
- The study confirmed that the questions on the FSA are aligned to Utah standards, which may vary significantly from those in Florida. According to the executive summary: “the items were originally written to measure the Utah standards rather than the Florida standards. While alignment to Florida standards was confirmed for the majority of items reviewed via the item review study, many were not confirmed, usually because these items focused on slightly different content within the same anchor standards.” (Validity report, page 47)
- The report recommended that results from the computer-based FSA not be used as a sole factor in student-level consequences, such as retention or remediation, and yet it is being used for this purpose in many districts. In fact, there is no mention in the study that the test is valid for its intended purpose, which is to assist teachers and improve learning. (Validity report, page 20)
- The study was intended to review a full range of FSA tests, including 3rd through 10th grade English Language Arts (ELA), 3rd through 8th grade Math and several high school tests including Algebra 1 and 2. The report, however, left out 11 of the 17 exams the study was supposed to measure. (Validity report, page 7)
- The report itself indicates significant shortcoming of the study, notably: “There are some notable exceptions to the breadth of our conclusion for this study. Specifically, evidence was not available at the time of this study to be able to evaluate evidence of criterion, construct, and consequential validity. These are areas where more comprehensive studies have yet to be completed. Classification accuracy and consistency were not available as part of this review because achievement standards have not yet been set for the FSA.” (Validity study, page 19)
- Following are excerpts from an interview with Andrew Wiley, Chief Psychometrician for Alpine who worked on the validity study: “Within that report, Wiley said, the reviewers did find enough data to support using the test results at an aggregate level, such as school grades. However, he cautioned, many unknowns about the impacts to individual schools and students exist. That means more study would be appropriate, as the report suggests”….”We need to talk about how the aggregate scores can be used,” he said. “It should be done with a lot of caution.” Wiley also shied from the idea of calling the test “valid.” “Validity is not a simple yes-no,” he said.
Flawed School Grading Formulas
One of the aspects of the accountability system with the most negative repercussions is the School Grading System. At the elementary level, grading formulas are based 100% on FSA scores (including learning gains, which are measured by the same test). At the middle and high schools levels other factors are involved, but the majority of the formula is still based on these tests. This doesn’t take into account known factors that effect student test scores that are beyond the control of the teacher or the school (i.e. socioeconomic environment, parent involvement, absenteeism, etc.).
The Florida Board of Education has the authority to set cut scores lower or higher in order to manipulate the outcome, further questioning the validity of decisions made based on these tests. In fact, that has happened several times during the course of the school grading system. Newspaper headlines include: “State may see more ‘F’ schools: Changes in system may net more failures” (2002); “FCAT-grade criteria to get tougher” (2003); “New FCAT issues raised: Some say tests easier” (2004); “FCAT reading scores on the decline” (2005); “Florida schools granted leeway: It may mean more public schools pass” (2005); “School grading system may change” (2008); “FCAT audit to delay school grades” (2010); “FCAT writing scores drop across Florida” (2012).” The rate at which cut scores can change certainly questions the reliability of decisions made based on those scores.
A Questionable System
Given the statistical research indicating known biases with standardized tests results, it’s no wonder that simple census data can be used to predict school grades. This renders those grades questionable at best, and it is certainly not advisable to use these scores for high-stakes decisions. This blog provides a good summary of what Florida’s school grades really tell us.
A 2015 Gallup Poll sought public input about teachers, teaching and school reform. Among the findings:
- About two-thirds of Americans say there is “too much emphasis on standardized testing” in public schools, and a strong majority—about 8 in 10—believes the effectiveness of their local public schools should be measured by “how engaged the students are with their classwork.”
- Only 14% report “scores that students receive on standardized tests” as “very important” measures of schools’ effectiveness.
- 55% oppose requirements that “teacher evaluations include how well a teacher’s students perform on standardized tests.”
- When asked directly about which approaches would provide the most accurate picture of a student’s academic progress, the public favored “examples of the student’s work” (38%); “written observations by the teacher” (26%); and “grades awarded by the teacher” (21%). By contrast, “scores on standardized achievement tests” represented only 16% of selections.
We owe it to our schools, teachers and students to develop a better way of measuring quality and ensuring accountability.