Retraction Warranted: Unreliable Data and Findings in IAAF-sponsored Research in BJSM
Roger Pielke, Jr., University of Colorado Boulder
Ross Tucker, University of Cape Town
Erik Boye, Oslo University Hospital
Below is a full replication of a letter that we (the three named scientists above) have just submitted to the British Journal of Sports Medicine, in which we explain why a research study looking at testosterone effects on performance in women should be retracted.
Below that, I share some of my own personal thoughts on this issue, and explain why, even though I agree with the IAAF on the principle of regulation and do not believe that the current situation is fair, the IAAF and BJSM are absolutely compelled to retract the paper. We cannot do “the right thing” (a point on which some may disagree, I accept that) in the wrong way.
Below is the story of “the wrong way”.
If you wish to download a PDF document of what you are about to read, here is the link:
On 30 April 2018 we requested the performance data reported in Bermon and Garnier (2017, BG17) from Dr. Bermon and the editor of BJSM. We requested aggregate performance data and not any linked medical data that would raise privacy concerns. We made this request after our inability to reproduce sample numbers, means and standard deviations as presented in their Table 3, based on performance data publicly available from the events that they analyzed.
We consider that independent replication of their results is important because the paper forms an important basis for a recently announced hyperandrogenism policy by the International Association of Athletics Federations (IAAF). The new regulations would compel female participants in certain events to undergo medical treatments to lower their testosterone levels in order to be eligible to compete. Thus, the research reported in BG17 is impactful and policy relevant. As BG17 is both funded and conducted by IAAF in support of its own regulations, it is perfectly reasonable to expect that independent scholars will wish to replicate their work.
On 6 July 2018 we and BJSM received from Dr. Bermon a subset of the original data of BG17, specifically for the 11 women’s running events reported in their Table 3. Unknown to us at this time, and not mentioned by either Dr. Bermon or BJSM, on 7 July 2018 BJSM published Bermon et al. (2018, BHKE18), which included the acknowledgment of methodological changes that had resulted in changes to sample sizes and calculated performance differences compared to the original 2017 study. On 9 July 2018 we submitted an earlier draft of this paper calling for BG17 to be retracted, for reasons discussed below. On 27 July 2018 the editor of BJSM notified us that BJSM would not retract BG17 and would not request further data disclosure from the authors.
In this Discussion we document the unreliable data and findings of BG17, both of which are confirmed by reported results of BHKE18. We also present the retraction policy of the publisher of BJSM and guidelines of the Committee on Publication Ethics (COPE) which are followed by most scientific publishers. We conclude that in rejecting retraction of BG17, BJSM does not follow its own policies or international standards of science publication. In this straightforward case, BJSM has compromised its scientific integrity and contributed to what appears to be a highly dubious evidence base for an important policy issue in athletics. Furthermore, the lack of action to uphold its stated policies from BJSM gives an impression that it is protecting the IAAF from the normal expectations of scientific research.
Upon receiving 25% of the original data from BG17 we undertook two tasks:
(a) replication of the overall summary statistics found in Table 3 of BG17, and,
(b) re-creation of the underlying dataset based on reported times from the 2011 (Daegu) and 2013 (Moscow) World Championships (via Wikipedia)
With respect to (a, replication) Table 1 shows that we were able to successfully reproduce the summary statistics with only small differences (emphasized).
Table 1. Replication of summary statistics for women’s running events from Bermon and Garnier (2017) for women’s track events. Small differences in replication emphasized in bold italics
With respect to (b, re-creation) we found significant anomalies and errors in the underlying data for the four events for which we recreated the data set by cross-checking times provided by Dr. Bermon with reported results from the 2011 and 2013 World Championships. We selected these four events (women’s 400m, 400mH, 800m and 1500m)  to recreate because they are central to the new regulations promulgated by the IAAF. According to IAAF, these regulations are based on the results and conclusions of BG17.
We have identified three types of anomalies/errors, in addition to the inclusion of times (for several events) for athletes who have been disqualified by IAAF for doping. These are:
- Duplicated athletes: more than one time is included for an individual. In each of these instances, more than one time from the 2011 and 2013 World Championships is included for the same athlete.
- Duplicated times: the same time is repeated once or more for an individual athlete, which is clearly a data error.
- Phantom times: no athlete could be found with the reported time for the event.
Table 2 provides a summary of the problematic data points for the four events.
Table 2. Re-creation of data of BG17 for four events, summarizing total problematic data points identified.
Problematic data make up between 17% and 33% of the values used in the BG17 analysis for these four events. Given the pervasiveness of these errors, we consider it likely that similar problems might be found in the data for the other 17 women’s events and 22 men’s events, and perhaps as well in the anonymous medical data, which are the basis for the study’s main conclusions regarding the performance effects of elevated testosterone levels. Such pervasive errors in the four events for which we carefully recreated data call into question the fidelity of the entire analysis.
When sharing the partial data, Dr. Bermon notified us that the dataset contained “some errors.” This was further confirmed with the publication of BHKE18, which stated: “We have excluded 230 observations, corrected some data capture errors and performed the modified analysis on a population of 1102 female athletes.” A comparison of reported observations in BG17 and BHKE18 indicates that only 220 observations were dropped from one study to the next, thus BHKE18 erred in its reporting of errors. Figure 1 shows that the dropped data points can be found in every event.
Figure 1. Dropped data points from BG17 to BHKE18, based on observations reported by event in each paper.
Table 3 shows the number of observations in BG17, the number of observations after data points were dropped in BHKE18, and those in our recreation of the BG17 dataset after identifying erroneous or anomalous data points. The table indicates that there are remaining uncorrected data errors in BHKE18.
Table 3. Comparison of data points in two published reports with our analysis of data provided by Bermon after correction for errors
The presence of unreliable data in BG17 is unambiguous: we have documented it empirically, the lead author has admitted to presence of errors and a subsequent analysis has sought to re-do the study after dropping 230 observations, acknowledging “errors.” Further, it appears that some amount of unreliable data persists in BHKE18, since the data provided to us does not match that used in the updated BHKE18 paper. We next show that the unreliable data leads to unreliable results.
The problematic data underpinning BG17 are significant and consequential for the results reported for all events, including the four regulated events. Table 4 compares the sample numbers, means and standard deviations we replicated using the full data set provided to us by Dr Bermon, with the corrected data once we had removed all duplicates, dopers and phantom times, as described previously (Table 2). It reveals that all three outcomes change for all female athletes in the four events upon the elimination of the previously described problematic data points. The change in aggregate times when using corrected data is of a similar magnitude to that of the testosterone effects that the authors seek to identify. Such consequential data errors easily confound identification of the effects that the analysis seeks to quantify.
Table 4. Performance changes for all athletes using replicated data (that provided to us of BG17) and corrected data (based on our re-creation).
Because we do not have access to the associated medical data, we cannot know what impact problematic data may have had on the BG17 conclusions. However, we can compare the results reported in BHKE18 with those of BG17 to assess the impact of dropping 220 observations.
BG17 and BHKE18 both report performances by event for athletes in the top and bottom tertiles of testosterone levels. It is this difference which is argued by IAAF as the basis for regulation of certain events for female athletes, alleging that women with higher testosterone outperform women with lower testosterone.
In comparing the two studies, the reported differences between these tertiles changed dramatically, as shown in Table 5. The size of changes in results from BG17 to BHKE18 are in a majority of instances similar to the magnitude of effect being investigated. Figure 2 shows the differences in results between the two studies, by event.
Table 5. Differences in results reported in BG17 and BHKE18
Figure 2: Performance differences between High and Low Testosterone tertiles (left panel) and change in difference between BK17 and BHKE18 (right panel)
Important differences in results between the two studies include:
- For 8 of 11 events the difference in performances between the highest and lowest tertiles decreased (including in 3 of 4 of the regulated events);
- In three events, the performance difference changed from positive (High T faster than Low T) to negative (High T slower than Low T);
- In 6 of 11 events reported in BHKE18, the low T tertile is faster than the high T tertile (compared to 3 of 11 events in BG17);
- In the four regulated events the average difference in times was reduced by 0.4% points (from 2.0% to 1.6%), and only 1 of 4 meets the BHKE18 standard for statistical significance (BG17 reported 3 of 4).
Clearly and unambiguously, the results reported in BG17 change quantitatively in BHKE18 upon removal of 220 data points and introduction of new methods. Without a full replication, it is impossible to know, but it seems plausible that the authors were unable to come even this close to the earlier results by using the same methods reported in BG17. The results of BG17 are clearly unreliable, and those of BHKE18 are of unknown validity.
Without access to the medical data and all linked performances used in BG17, it is impossible to know how or why certain athletes were removed and others not. What is unequivocal is that BG17 used unreliable data and thus its results are also unreliable. Different data and methods were used in BHKE18, leading to significantly different results, based on the almost certain use of unreliable data, leading consequently to unreliable results.
A strength of science is that it is self-correcting. Errors are inevitable in research and when they are identified, they are corrected. The Committee on Publication Ethics explains that is some cases, the retraction of a scientific paper may be warranted: “Retraction is a mechanism for correcting the literature and alerting readers to publications that contain such seriously flawed or erroneous data that their findings and conclusions cannot be relied upon. Unreliable data may result from honest error or from research misconduct.” COPE further explains: “Publications should be retracted as soon as possible after the journal editor is convinced that the publication is seriously flawed and misleading (or is redundant or plagiarised). Prompt retraction should minimize the number of researchers who cite the erroneous work, act on its findings or draw incorrect conclusions.”
BJM, the publisher of BJSM, has a retraction policy that, like most scientific publishers, follows the guidelines of COPE: “Retractions are considered by journal editors in cases of evidence of unreliable data or findings, plagiarism, duplicate publication, and unethical research.”
In various stages, we have identified and documented unreliable data and findings in BG17. This based on our independent analyses of ~25% of the original data of BG17, as provided to us by Dr. Bermon, and is confirmed by the subsequent effort to re-do BG17 in BHKE18. The revised approach of BHKE18 explicitly acknowledges the removal of 230 data points (itself a likely error, based on comparisons with the data we were provided).
Thus, there can be no doubt that BG17 contains “such seriously flawed or erroneous data that their findings and conclusions cannot be relied upon.”
Based on the evidence, BG17 does not present an editor or publisher with a complex or difficult situation. Yet, not only were we surprised and disappointed by the BJSM decision not to withdraw BG17 in light of the evidence, but BJSM has also refused to require the authors of BG17 or BHKE18 to release their data to allow for independent replication. The editorial process used by BJSM to arrive at these judgements is also unknown. We find these issues to be highly problematic for a scientific journal.
We maintain our call for BG17 to be retracted and suggest that BHKE18 also merits consideration for retraction. BHKE18 is not a peer-reviewed study, and despite the clear differences in results compared to BG17, mischaracterizes its conclusions as providing “consistent and robust results and has strengthened the evidence.” The IAAF stated to the New York Times of BHKE18 that “the conclusions remain the same” as BG17. This demonstrable falsehood has been enabled by BJSM, and is sure to propagate further into the scientific literature and policy settings.
Consequently, the unwillingness of BJSM to uphold basic standards of scientific integrity has unnecessarily muddied the evidence base on which sports governance decisions are being made. At worst, there is an appearance that BJSM is shielding IAAF and its researchers from meeting normal expectations in scientific research.
More generally, this case illustrates clearly the importance of data sharing in science as well as the role of independent checks on data with policy or regulatory significance. This is especially the case when an interested party (in this case IAAF) is sponsoring research to support a policy that it advocates. Conflicts of interest are best dealt with via transparency and commitments to integrity. We encourage BJSM to adopt a more rigorous policy on procedures for retraction and ensuring data availability consistent with best practices among scientific publishers. Mistakes happen. Science is robust because mistakes can be corrected. When mistakes can’t be corrected we are no longer dealing with science, but something else.
High testosterone, unfair advantages and principles of transparency – preface
So, what you’ve just read is the history of what began as a relatively innocuous request for data, born out of curiosity, because some of the performance times reported in the Bermon paper didn’t quite add up, and we couldn’t figure out how they got them using the methods described.
That innocuous process has culminated in a call for retraction. I realize that calling for a paper to be retracted is the “nuclear” option. It’s a big deal. It is not one I took lightly at all. However, it is warranted in this case. In fact, in my opinion, this is a very simple case. A very clear example of a paper that should be retracted. Below is the full history of that story, which explains the process we followed, what we found, and why we asked for retraction.
I must stress the following – I agree, in principle, with the IAAF’s attempts to regulate the boundary between women’s and men’s competition. I have done since 2009, when Semenya won the 800m world title in Berlin. I also believe that testosterone is a viable means to achieve this, because testosterone is a root cause of many of the biological differences that give men an enormous performance advantage over women. How enormous? Well, in 2017, taking the 800m event as an example, there were 4,000 men who were faster than the fastest woman. There are 12,285 KNOWN performances faster than that women’s world leader in 2017 alone. These are massive differences, and testosterone is crucial to them.
The present situation of women with XY chromosomes, internal testes and high levels of circulating testosterone competing without any regulation is a massively complex one for sport to deal with. It is one that has no solution that keeps all parties happy. It is sport’s unsolvable problem.
Both qualitatively and quantitatively, it is nothing like being tall in basketball, or having long arms or big feet in swimming. First, rightly or wrongly, we do not compete in categories of height in basketball. There is no “small-footed” Olympic swimming champion, because we have not deemed it necessary to “protect” the competitive integrity of people with the disadvantage (small feet, in this example) compared to those with big feet.
For a better analogy, think of Paralympic sport, where people with different degrees of disability are separated to ensure fairness. Or think of boxing, where weight categories exist to protect both competition and safety of smaller fighters. In these sports, we have decided that an advantage (less severe cerebral palsy, or being bigger) change the meaning of competition and so we create a category for the ‘disadvantaged’ that ensures competition integrity.
If we did have a category for short people in basketball, by the way, (the under-6-foot Championships), then I’m sure all would agree that we’d need to measure the height of people who wanted to play it, and we would not allow exceptions for someone who was 6 foot, half an inch, even though “it’s not their fault, they were just born that way”.
On a quantitative level, the differences made by T are also profound compared to other advantages – as I wrote above, 12,285 performances by men were faster than the women’s fastest 800m runner in 2017 alone. If you think that foot size makes a big difference to swimming performance, and that having small feet are the disadvantage, then can you honestly say that you think that 12,285 performances are recorded by swimmers with big feet before the fastest small-footed swimmer would finish? Of course not – there is no comparison between the advantage conferred by the Y chromosome, the testes, and the testosterone. It is an order of magnitude larger than anything enjoyed by a tall man in basketball, a long-armed swimmer. In fact, being genetically male is the single biggest performance advantage in sport. This is biological reality (in most sports). So no, not all advantages are equal, and so I believe that the Y-chromosome/Testosterone is an advantage that must be regulated in order for women’s sport to have meaning.
So, my personal opinion is that some kind of regulation is important, even essential. In a theoretical world, this regulation would be based solely on the presence of a Y-chromosome and internal testes, but of course, the situation is more complex than this, because of conditions that render individuals unable to use this testosterone. The attempts to accommodate and respond to this complexity have, ironically, created even more complexity, and we are at a point where this important (in my opinion) regulation is borderline impossible.
For these reasons, I felt that the previous IAAF policy that required T levels to be lower than 10 nmol/L, and that was applied to all events was a reasonable compromise. I think that the Court of Arbitration for Sport erred when they set it aside, and asked the IAAF to come back with strong evidence of the advantage (which, and this is the crux, they framed as 10 to 12%, given that this is men’s advantage over women). The IAAF were faced with an impossible task – how do you gather this research, how do you attempt to ask this question in a quality study? It is impossible, short of getting say 30 individuals with DSDs and giving half a placebo and half the medication to lower their T values, and then controlling EVERY single other factor that may affect performance, and monitoring for 12 months? As I have said a few times, the good study is unethical, and the ethical study is simply not very good.
And so, at the time of that CAS decision, I felt some sympathy for the IAAF, that they had an uphill task, and that whatever they came back with would likely be insufficient to clear the bar set by CAS, unless their lawyers could work some Froome-like magic.
All of which brings us to the present situation. What the IAAF have done is conduct a research study that our very basic re-analysis shows is deeply flawed, with the potential to massively compromise its conclusions, and hence the policy on which it is built.
And so, despite my agreement with the principle of regulation, and even part agreement with the use of Testosterone to achieve it, I cannot abide by the fact that their data is so poor, and their policy built on such a deeply compromised research study.
The principles of transparency, scientific integrity and accountability are, for those who’ve followed me for a while, a recurring theme. Whether in anti-doping, technology, financial management, transparency is crucial for good governance, and trust in leadership. Sport cannot afford unnecessary secrecy. The IAAF have failed on this count, and the British Journal of Sports Medicine have not upheld the scientific requirement for the same.
Therefore, I gladly put my name and time into this process of evaluating that Bermon et al study, and the call for its retraction. Principles of transparency and scientific integrity are bigger than the hyperandrogenism policy and my own personal views on whether Semenya should be allowed to compete as she has. What it boils down to is that it’s all good and well to do the “right thing” (and I appreciate that people may disagree on what the right thing is in this case – even my co-authors Roger and Erik do, and that’s fine. We respect one another enough to collaborate despite it!), but you have to do it the right way too.
And if you do the right thing, or are trying to do the right thing (in principle), but you do it the wrong way, well, that simply can’t be allowed to stand. Not by the IAAF, and not by the British Journal of Sports Medicine.
This is the story of that “wrong way”, and why we cannot allow it to stand.
 Bermon and Garnier (2017), Serum androgen levels and their relation to performance in track and field: mass spectrometry results from 2127 observations in male and female elite athletes, British Journal of Sports Medicine. http://dx.doi.org/10.1136/bjsports-2017-097792
 Bermon, S., Hirschberg, A. L., Kowalski, J., & Eklund, E. (2018). Serum androgen levels are positively correlated with athletic performance and competition results in elite female athletes. Br J Sports Med, bjsports-2018.
 Again, only 220 data points were actually dropped.