To address the “replication crisis,” Michael Inzlicht, a U of T professor of social psychology, shown here at his UTSC lab, is among researchers calling for greater openness and transparency about how experiments are conducted. Photo by Lorne Bridgman.
Michael Inzlicht was a post-doctoral fellow in applied psychology at New York University in 2002 when he first heard the term “ego depletion.” A recent academic study had suggested that using your self-control or willpower on one task made it more difficult to apply it to others. The concept helped to explain why people might skip the gym after a stressful week at work or give up on a diet after having to soothe an unhappy child all day.
For Inzlicht, ego depletion was a revelation. “I had never heard of it,” he says. “My adviser mentioned it to me and I thought, ‘This is such an elegant theory. It seems so intuitively real and true.’” He decided he wanted to study it himself.
Five years passed, and Inzlicht, who had become a professor of psychology at U of T Scarborough, was focusing his own research primarily on ego depletion. He had attracted a number of grants because of it and that year had written an influential paper on the topic. “I was definitely excited,” he says. “I thought I was onto something big and important and I wanted the whole world to know.” Soon after, he received tenure at U of T. In a sense, he felt ego depletion had helped secure his career.
Then, in 2011, he read a Psychological Science article called “False-Positive Psychology” that shook his faith in his own research.
“It detailed all the possible steps a researcher could take to make nonsense research findings appear to make sense,” he says. The paper described common practices in the field – drawing conclusions from a small number of observations, making subjective, on-the-fly judgments about how to collect and interpret data, and reporting only what “worked” and not what didn’t – that could lead to false positive results.
Inzlicht began to question the early experiments that had helped forge his career. His worry turned to doubt, which devolved into a conviction that he had made some of the very errors the paper described.
“I don’t remember exactly what I did to get my early research results but I have no doubt that they were shaped by what we now know are questionable research practices,” he says. “One study had 42 participants divided into four groups. In hindsight, this is severely underpowered. Even if the results I’d been seeking were true, I’d be very unlikely to find it from those small numbers.”
Across social psychology, as well as in economics, health sciences and other disciplines, many seemingly well-established concepts and theories were being called into question. Efforts to replicate many studies – to derive the same results from experiments performed using the same parameters as the original study – were not successful.
Scientists are now debating the severity of the so-called “replication crisis” and what to do about it. Inzlicht and several other U of T researchers, along with colleagues from around the world, have begun advocating for improvements to research methods.
Among these champions of change, Inzlicht stood out for his candidness and deep emotion. In essays, blog posts and interviews, he has publicly called into question his own career-making research. He bluntly chronicled the systemic problems that had caused the crisis, and also acknowledged how his own work embodied those flaws.
“What I discovered about myself was not pleasant. It hurt to learn that my early papers were dominated by findings that did not appear robust; it was upsetting to think that my early work stood a decent chance of not replicating,” he wrote in 2016. “To publicly admit my past shortcomings was scary.”
He was worried for his job, for the respect of his colleagues and for his prospects. More than any of that, though, he just felt dispirited by the idea that many years of hard work might have been for nothing.
“If I admit the field is rotten, that means I have contributed to that rot. That’s hard. I’m going through turmoil as a result of that,” he says.
Inzlicht’s dejection came with a sort of grim resolve. He still believes social psychology research can demonstrate meaningful, interesting, real effects. Getting there requires better research methods. It also means going back and testing, questioning – and possibly rejecting – many long-accepted concepts.
“I really admired Michael’s approach,” says Brian Nosek, a psychologist at the University of Virgina and a co-founder of the independent Center for Open Science, based in Charlottesville, Virginia. Nosek spearheaded attempts to replicate results from 97 published psychology studies that had claimed a positive outcome; 35 were successful. “He very effectively captured a feeling that many researchers have of wanting to resist the evidence about reproducibility but also being concerned by it. He gave voice to that reaction: What does this mean to my work? Should I be doubting my contributions?”
Nosek calls what’s happening now “science at its best.”
“Maintaining that constant self-skepticism, testing to see if our field is as robust as it can be is how science is supposed to work,” he says.
Elizabeth Page-Gould, a U of T psychology professor and a colleague of Inzlicht’s, is among those calling for more open and transparent research practices. “Even simple measures would help,” she says, such as providing “badges” on papers that indicate where to find the source materials and data online, or asking researchers to “preregister” their hypothesis to ensure that it doesn’t change later to suit the results. The experiment’s outcome – successful or not – should be measured against its original intent, she says.
Gould, Inzlicht and others also emphasize the need to report on “failed studies.”
Scientists have a deep aversion to asserting straight-up success or failure. They rarely describe results in terms of truth or falsehood. Instead, they speak about “degrees of certainty.” It’s why virologists say, “There’s no credible evidence that vaccines cause autism,” rather than the more definitive “Vaccines don’t cause autism.”
If an experiment fails to support a hypothesis, a researcher naturally considers whether the study had flaws. They try again, controlling for different variables, adjusting parameters and working to remove factors that might create a “false negative.”
If on, say, the 20th attempt, the data reveal the expected effect – hypothesis confirmed. It is possible, and can feel reasonable, to publish the successful results and dismiss the 19 practice runs.
But, what if all the tweaking and adjusting didn’t actually do anything, and that final outcome merely resulted from random variations in the data?
Researchers have an indicator for that eventuality, which involves something called a “probability value” or “p-value.” The p-value describes how likely it is that random variations in data could show an effect when there really isn’t one.
The p-factor is written as a decimal number between zero and one. The closer to zero, the more likely it is that the observed effect is real.
Sample size, data precision and other factors can affect p-values. An experiment is commonly considered solid if its p-value is no more than 0.05, meaning there’s a one in 20 chance the observed effect comes from random variation.
But each new attempt at fine tuning makes a random blip more likely. Not reporting the pre-success tests distorts the significance of that one positive result. Such selective reporting is known as “p-hacking,” and it’s one of the major reasons studies fail to replicate.
Academic journals and institutions have also contributed to the replication crisis through their tendency to favour positive results over negative.
Institutions’ policies have been changing, including those of U of T. “We’re developing standards for data management, and moving toward more open and accessible data,” says Vivek Goel, the university’s vice-president, research and innovation, and a professor at the Dalla Lana School of Public Health. He says replicability must be viewed as an indicator of research excellence.
Many journals have also started to address their role in the replication crisis.
Peter Morrow, a U of T professor of economics, is the data editor for the Canadian Journal of Economics. That journal now has a policy that papers must include all data and the processes researchers used to generate results.
Morrow says transparency helps reveal (valid) subjective judgments that often lead to economists reaching different conclusions from identical data.
“Suppose you and I were studying the history of employment levels in Puerto Rico,” he says. “You might say, ‘The last couple of months were weird because of the hurricanes. I should drop them because they are less likely to be indicative of long-term trends.’ But I might say we should keep them because they add information and we don’t know how long-lasting their effects might be. These are subjective edits. Even though we start with the same data, we end up with different results.”
Subjectivity isn’t inherently problematic. In fact, it’s part of good science. Trouble arises when biases and judgment are obscured.
“Certain widely used statistical practices seem justifiable in the moment, but can easily inflate the rate of false discoveries. They used to be considered not optimal, but acceptable. That perception has changed now,” says Ulrich Schimmack, a psychology professor at U of T Mississauga. Schimmack has worked for years to develop new tests and tools to improve replicability. “People are demanding more rigorous statistical analysis. The whole perception of what is proper science has changed in psychology.”
Replicating a study can be expensive, difficult and time-consuming. Statistics can reduce the costs – and the effort. “I have developed a statistical tool that tells us what we could expect if we did replication studies,” he says. In other words, he can test for replicability without actually running the experiment again. He calls his replicability index “a doping test for science.”
His statistical models, which he continually refines and tests against real-life replication efforts, use sample size, experiment design and a range of other factors to predict the likelihood of replicability. He reviews and assesses published studies to measure whether the crisis is improving.
“I started tracking studies from 2010,” he says. “Up to 2015 there’s no change. In 2016 there’s something, but only in 2017 does it seem there’s a real change.”
He hopes that tools like his will push researchers, institutions, funding agencies and journals to make replicability a cornerstone of research excellence.
Michael McCullough, a psychology professor at the University of Miami, says the research community has “made a kind of mess for ourselves,” but he cautions that plenty of psychology research remains solid.
“There are facts about how human psychology works,” he says, describing effects such as optical illusions, after-image effects, verbal overshadowing (where the act of recounting an experience changes your memory of it) and the “cocktail party effect,” where you can hear your name spoken in a crowd even if your attention is elsewhere.
He cautions against wholesale dismissal of real psychological effects.
Inzlicht still believes that ego depletion is real. But he has all but given up on trying to capture it in an experiment. Meanwhile, he has changed his research methods and continually checks his more recent work for replicability. For what it’s worth, he says, these new approaches are making a difference.
“What’s frustrating, and it’s a field-wide problem, is that the work we’re doing now should have been done from the beginning,” he says. “The first bricks were never laid. Instead an entire edifice was built without a strong foundation.”
Patchen Barss is a Toronto-based journalist and author specializing in science, technology, research and culture.