A Morally Driven Case for Increasing Methodological Rigor in Aggression Science
/In the personal statement I submitted with my graduate school applications in 2013, I opened with a disclosure about the murder of my middle school baseball coach and how this compelled my interest in researching aggression. I remember wondering if I was building a case for a truism: aggression (e.g., bullying, domestic violence, mass shootings) is bad (1); therefore, reducing aggression would be good. The case for the importance of my topic-of-interest seemed obvious. Aggressive behavior has profound human costs (i.e., distress + impairment) for both victims and aggressors, and it is economically burdensome and a looming existential threat. As outlined in ISRA’s mission statement, “effectively addressing aggression…requires a committed and sustained focus by international scientists.” I entered graduate school eager to join these ranks.
More than 10 years later, I still believe the world would be a better place with less aggression, and thus, using my skills as a psychologist to support this aim is worthwhile. If you share these ethics and goals, we are meeting on common ground. From this place of shared values, I argue that increasing the methodological rigor and replicability of aggression science is a moral imperative.
Much ink has been spilled on the replication crisis in psychology. In short: a troublingly high percentage of published empirical articles do not replicate and yield meaningfully different conclusions than the original studies, raising concerns about the trustworthiness of the entire literature. The used car market is a great analog for highlighting the dangers this poses for our science. In this market, there is an asymmetry of information about product quality — i.e., the salesman knows a car is a lemon but you don’t. When a meaningful percentage of the products are found to be shoddy (e.g., your used car breaks down), trust in all such products degrades, and the market may disintegrate because no one wants to waste their money on a lemon. Like used car salesmen, researchers know what is “under the hood” of their products (2) — e.g., how the data were cleaned, inferential tests conducted, etc. — but editors, reviewers, and readers don’t. So when a meaningful percentage are found to be shoddy (e.g., severely underpowered, non-replicable), trust in the broader literature is compromised because no one wants to waste time, money, and/or resources basing a policy, intervention, or follow-up study on a lemon.
How did we end up in such dire straits? Key factors posited to contribute to non-replicability include:
1) Low statistical power. Researchers run analyses incapable of providing valid inferences about the effects of interest, with the biggest culprit being small sample sizes (see Cohen’s prescient warning).
2) P-hacking. Researchers use analytic flexibility or undisclosed data practices (e.g., including/excluding outliers or covariates) in pursuit of statistically significant results. Pressure to report statistical significance is strong due to …
3) Publication bias. Publications are the coin of the academic realm, and peer review is biased toward significant results and against null results. Thus, researchers may try to “play the game” to find statistically significant results in their data and/or relegate non-significant findings to the file drawer. Over time, this increases Type I error that is tough to correct and constrains theory falsifiability.
4) Hypothesizing after results are known (i.e., HARKing). Researchers present post hoc hypotheses as if they were a priori, due to bias against the null and preference for confirmatory (vs. exploratory) research in line with the hypothetico-deductive method.
P-hacking and HARKing are tough to prove (though available evidence suggests rates are certainly > 0%), and publication bias assessment tools are helpful but are more suggestive than diagnostic. Statistical power, on the other hand, is readily calculable, and under-powered studies are unfortunately the norm in psychological science. These concerns pervade subfields of psychology, and the aggression literature is no exception. Some of my own work suggests the lab aggression literature is under-powered (average power to accurately identify main effect = 58%), especially tests of interaction effects (average power = 12%). So, even when an interaction effect is true in the population, the average lab aggression study is only powered to identify an effect of typical magnitude (or greater) about one time out of ten!!!
If we want other scholars, policymakers, interventionists, and the public to trust our research, the onus is on us to increase the credibility of our work. The good news is that change is afoot! Statistical power in lab aggression studies has increased in the recent decade, methodological features that can improve replicability rates have been identified (e.g., strong manipulation checks), and leading journals in our area have implemented open science badges to reward authors for transparent scientific practices (though overall rates remain low). I am hopeful to see the adoption of Registered Reports at aggression-specific outlets, such that well-designed, well-powered, informative studies are accepted regardless of results, combatting bias against the null. There are also abundant resources to help aggression researchers with power analyses for study planning, as well as templates for pre-registering one’s plan for analyzing original and secondary data.
In other words, open up the hood! Be transparent with your data, analytic code, and specific predictions so readers can judge for themselves how rigorously you’ve tested your claims. For example, although the variety of lab aggression operationalizations can be viewed as a strength, the operationalization strategy a researcher chooses can meaningfully change conclusions, leaving the door open for suspicions of p-hacking or HARKing. Pre-registration and the provision of analytic code and data can anticipate and mitigate such critiques. If we are asking for trust, we must allow verification.
Aggression is a topic worth studying, and therefore worth studying as rigorously as possible. When the rubber of our science meets the road of the real world, it must be robust if we want to reach our destination.
Courtland S. Hyatt, Ph.D., is a clinical psychologist and assistant professor in the Department of Psychiatry and Behavioral Sciences at Emory University.
1. There are, of course, instances where intentionally harming another person who doesn’t want to be harmed (e.g., protecting a loved one; combat) can be “good,” i.e., morally justified.
2. I am not suggesting intentional dishonesty is always or even often at play. Sure, deceitful researchers exist, but non-replicability can also result from a series of justifiable analytic decisions that aggregate to shift conclusions.