And why does this matter in drug development?
Statistical analytical methods are often taken for granted. In a recent crowdsourcing data analysis project, Nosek and co-workers found 29 research teams willing to analyze the same dataset [1]. The results varied from positive to neutral. How is this possible and what are the consequences?
The dataset
From a company for sports statistics, the researchers obtained demographic information on all soccer players (N = 2,053) who played in the first divisions of England, Germany, France, and Spain in the 2012–2013 season. In addition, they obtained data on the interactions of players and referees (N = 3,147) across their careers.
The research question
The research question was: “Are soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players?”. On the basis of photos of the soccer players, two independent raters blind to the research question categorized the players on a 5-point scale ranging from 1 (very light skin) to 3 (neither dark nor light skin) to 5 (very dark skin). Available variables on each player included height, weight, number of games played, number of yellow and red cards received, league country, player position, etc etc.
Crowdsearching for research teams willing to participate
A description of the project was put online and the project was advertised via the first author’s Twitter account, blogs of prominent academics, and word of mouth. Twenty-nine teams submitted a final report. The teams originated from 13 different countries and came from a variety of disciplinary backgrounds, including psychology, statistics, research methods, economics, sociology, linguistics, and management. Sixty-two percent of the 61 data analysts held a Ph.D. and 28% had a master’s degree. The analysts came from various ranks and included 8 full professors (13%), 9 associate professors (15%), 13 assistant professors (21%), 8 postdocs (13%), and 17 doctoral students (28%). In addition, 44% had taught at least one undergraduate statistics course, 36% had taught at least one graduate statistics course, and 39% had published at least one methodological or statistical article.
Methodology
Each team then decided on its own analytic approach to test the primary research question and analyzed the data independently of the other teams. Hereafter, they were encouraged to discuss and debate their respective approaches to the data set. Some teams found their method to be susceptible to outliers, whereas other methods were not.
Approaches
The analytic techniques chosen ranged from simple linear regression to complex multilevel regression and Bayesian approaches. The 29 teams used 21 unique combinations of covariates. Apart from the variable games (i.e., the number of games played under a given referee, which was used by all the teams), just one covariate (player position, 69%) was used in more than half of the teams’ analyses, and three were used in just one analysis. Three teams chose to use no covariates.
Results
The estimated effect sizes, expressed as Odds-Ratio’s, ranged from 0.89 (slightly negative) to 2.93 (moderately positive); the median estimate was 1.31. Twenty teams (69%) found a significant positive relationship, p < .05, and nine teams (31%) found a nonsignificant relationship, 7 nonsignificantly positive and 2 nonsignificantly negative. No team reported a significant negative relationship.
Interpretation
The authors argue that the variability in results cannot be readily accounted for by differences in expertise. Analysts with high and lower levels of quantitative expertise both exhibited high levels of variability in their estimated effect sizes. Further, analytic approaches that received highly favorable evaluations from peers showed the same variability in final effect sizes as did analytic approaches that were less favorably rated. the extent to which good-faith, yet subjective, analytic choices can have an impact on research results. This is distinct from, the problem of p-hacking, where researchers deliberately chose a statistical method which produces a p-value < the sacrosanct 0.05 cut-off. The authors also argue their findings are also distinct from the so-called ‘forked path’ analysis, where researchers first look at patterns in the data before testing for statistical significance, a problem which is prevented by defining a statistical analysis plan before database lock.
So tell me, are soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players?
The authors write: The findings collectively suggest a positive correlation, but this can be glimpsed only through the fog of varying subjective analytic decisions.
What does this mean for drug development?
Although I am not aware of any formal investigation of this question, it may well be that the above problem is much smaller in drug development, as long as methodological standards, such as blinding, randomization, a predefined statistical analysis plan, high quality data management are being adhered to, which is usually the case if we look at well-known pharmaceutical companies and research organizations. Also, the requirements set for statistical analysis by regulatory agencies create a level playing field, to stay in soccer terms. The changes these agencies implement in their requirements from time to time offer an opportunity to investigate the influence of different statistical techniques on the outcome. When looking at seven different methods to handle missing data [2], ranging from last observation carried forward to mixed model imputation techniques, the difference between the highest and lowest estimated treatment difference compared to the highest estimated treatment difference between the two treatments investigated amounted to 24%, compared to 62%in the Nosek paper. All these different imputation techniques resulted in a p-value < 0.001, whereas the p-values in the Nosek paper ranged from indicating significance to neutral.