Reader Beware: Interpreting Clinical Trial Data

Numbers don’t lie – except when people cause them to. 

With an abundance of scientific research being presented and published, many health-care practitioners rely on the researchers’ interpretation of their data to glean clinically relevant information. But, as some recent examples have shown, they might need to pay closer attention to how data are being reported.

“It is incumbent on clinicians to be able to read the literature with a critical eye, rather than relying on others to vet the findings,” said Ian F. Tannock, MD, PhD, DSc, emeritus professor of medical oncology at Princess Margaret Cancer Centre in Toronto. “There are [statements] that get into even the highest-impact journals that are questionable.”

What appears in scientific journals – even reputable, high-impact ones – may represent an accumulation of mistakes, errors, and miscalculations that occur throughout the research process – from trial design, conduct, analyses, and eventually interpretation. Statistics can be misleading, and most medical professionals are armed only with a basic understanding of this branch of mathematics and may not be able to determine when the wrong test is applied to the right data. Visual representations can be useful for digesting large amounts of data but they, too, can be hard to interpret and easy to manipulate.

ASH Clinical News spoke with Dr. Tannock, statisticians, and other experts in trial design about the strengths and limitations of research designs, the ways in which statistics can point to an incorrect conclusion, and advice for how clinicians can arm themselves against misinterpretation.

Statistics and Humans: A Poor Match

According to Nobel Prize–winning behavioral economist Daniel Kahneman, PhD, humans are not intuitive statisticians. We learn grammar intuitively, along with a wealth of other skills important for survival; when it comes to statistics, though, our brains prefer simple answers and cognitive ease. We tend – without even realizing we’re doing it – to accept more straightforward, less brain-taxing explanations. In effect, we jump to conclusions based on limited information.

Most readers appreciate a simple graph, infographic, or other visual tool that allows them to “get the point” without having to slog through tedious text. But visualization is another method by which researchers can deceive with data – either intentionally or unintentionally. (Can you really trust what a graph is saying? See the SIDEBAR.)

“Much of [how we interpret data] is just confirmation bias,” said Sara R. Machado, PhD, an economist in the Department of Health Policy at the London School of Economics who studies blood donation systems and donor retention. “We’re more likely to believe results that align with our prior hypothesis, so if the statistics show what we are expecting to see, we don’t look with a critical eye.

“Part of the problem is that we tend not to recognize when our understanding is limited,” she added. “For example, I know basic Italian. If you give me a book to read in Italian, my imperfect knowledge of Italian might have misled my interpretation of the meaning.”

Take the statistical phenomenon “regression to the mean”: An unusually large or small measurement typically will be followed by a value that is closer to the mean because of random measurement error. In other words, things tend to even out over time.

Even researchers who know not to imply causality from nonrandomized data can readily fail to recognize and label random fluctuations. This tendency is apparent in studies examining health-care interventions in populations that have high-risk disease characteristics. Because the “high-risk” classification often implies outlier values, individuals initially identified by their outlier values will likely have lower values on remeasurement, with or without intervention.

“If presenters take the time to explain the story their data are telling, rather than using jargon and statistics, it’s much easier to avoid this kind of narrative fallacy,” Dr. Machado said. Translating statistical observations into simple language (e.g., “as generations go on, people who are worse off tend to improve and those who are better off tend to worsen”) lowers the likelihood of misinterpretation, she added.

The Problem With P Values

In scientific literature, a p value of less than 0.05 generally is set as a benchmark to determine whether findings are “statistically significant,” but readers can make the mistake of conflating that number with clinical significance, according to Grzegorz S. Nowakowski, MD, from the Mayo Clinic in Rochester, Minnesota. “P values don’t tell you anything about clinical benefit. They only tell you how likely your results are to be true and not a play of chance,” he explained. Dr. Nowakowski also serves as a co-chair of the American Society of Hematology (ASH) Working Group on Innovations in Clinical Trials.

Allan Hackshaw, PhD, an epidemiologist at Cancer Research UK and the University College London Cancer Trials Centre who teaches clinical trial design and consults with trialists, agreed. “A p value just addresses the question, ‘Could the observed result be a chance or spurious finding in this particular trial, when in reality the intervention is completely ineffective?’” The answer to this question is always “yes,” he said, but people tend not to ask this question when interpreting p values.

In the same way that one could flip a coin 10 times in a row and come up with heads every time, despite there being nothing wrong with the coin, a p value of <0.05 does not necessarily indicate that a treatment is effective. The commonly used cutoff value of 0.05 means that one illegitimate effect (false-positive result) is expected in every 20 comparisons.

“The American Statistical Association clearly states that p values should not be used in making clinical decisions – or any sort of decisions – and yet, that’s exactly what journals and registration agencies do,” said Dr. Tannock, who has worked throughout his career to improve the quality and reporting of clinical trials. “They should be using effect size and some measure of value.”

Researchers can conduct a trial in 2,000 patients and identify a difference of a few days in survival, he offered as an example. “It might be statistically significant, but it’s not clinically important. Moreover, those individuals selected for the trial were quite possibly heavily selected to have high performance status,” Dr. Tannock explained. “When you try to see this same small difference in the general patient population, the effect is smaller and the toxicity is higher.”

“While we could conclude falsely that a treatment is effective when actually it is not, there also are examples where there are clearly large benefits but, with a p value just above 0.05, the authors may conclude that there is no effect, and this is plainly wrong,” added Dr. Hackshaw.

When results just miss statistical significance, assessing the evidence requires great care. “We all have different feelings about the data,” Dr. Nowakowski said. “I may see the data as being potentially marginal, while someone else might see a potentially huge benefit. There’s always a degree of subjectivity, and this nuance is often lost in transmission.”

Depending on the study design, trials can be fragile, Dr. Tannock noted. “Sometimes it takes only moving two or three patients from one side to the other, from positive to negative, and you can completely lose the trial’s significance.”

Other points of weakness exist but can go unnoticed by the average reader. For example, the inclusion of multiple comparisons and endpoints increases the likelihood of erroneous inferences. Also, large biases in a trial’s design or conduct might partially or fully explain the observed treatment benefit, and these reveal themselves only after a careful review of an article’s methods section.

Case in Point: Bad Blood

As any trialist can attest, designing, running, and interpreting a trial that produces clinically meaningful and clinically sound results is not easy. There are many opportunities for misinterpretation, as evidenced by the 2016 case of a study of red blood cell transfusions from younger and older patients.

First, JAMA Internal Medicine published a study from a team of Canadian researchers that suggested that red blood cell transfusions from younger donors and from female donors were statistically more likely to increase mortality in recipients.1 Using a time-dependent survival model and data from 30,503 transfusion recipients, they determined that patients who received blood from donors aged 17 to 19.9 years had an 8 percent higher mortality risk than those receiving blood from donors aged 40 to 49.9 years (adjusted hazard ratio [HR] = 1.08; 95% CI 1.06-1.10; p<0.001). Similarly, an 8 percent increase in risk of death was noted for those receiving blood transfusions from female donors compared with male donors (adjusted HR=1.08; 95% CI 1.06-1.09; p<0.001).

This publication was soon followed by an observational, matched-cohort study published in Blood the same year, wherein investigators found no associations between blood donor age and mortality among 136,639 transfusion recipients.2

“The original researchers assumed that the risk of death and the risk of multiple transfusions were linear, when they really were not,” explained Alan E. Mast, MD, PhD, from the BloodCenter of Wisconsin and a co-chair of the ASH Working Group on Innovations in Clinical Trials. “The data curved because the risk of getting multiple transfusions increased the likelihood of dying, but the risk of getting a different transfusion from a young blood donor increases over that time differently than the risk of dying.”

In light of these discrepant findings, investigators at the Karolinska Institute in Stockholm conducted their own analysis, using methods similar to the those in the Canadian study but taking a different approach to control more rigorously for potential confounding variables associated with the total number of units transfused.3

Their findings: Neither donor age nor sex was associated with recipient survival. “Any comparison between common and less common categories of transfusions will inevitably be confounded by the number of transfusions, which drives the probability of receiving the less common blood components,” the authors concluded.

“When you assume linearity between a covariate and the dependent variable, you are essentially averaging out the effect,” Dr. Machado explained. “When people receive multiple transfusions and there is a true nonlinear effect, in a way, you are attributing to each transfusion the average effect of all transfusions.”

“[This case] is a good example of researchers coming out and asserting something, and their findings got a lot of attention, but when other researchers went back and used different statistical techniques, they found it just wasn’t true,” Dr. Mast added.

Torturing Data Into Confession

Misinterpretation of data can typically be attributed to eagerness to transmit findings. “We have many attractive new agents and therapies that we would like to move quickly to the clinic and to patients, so it’s a tough balance to design the most applicable studies that show the true benefit but also to find ways to finalize the study faster and move it to clinical practice more quickly,” said Dr. Nowakowski.

This tension between caution and enthusiasm plays out in the murky waters of subgroup analyses.

“We all have different feelings about the data. … There’s always a degree of subjectivity, and this nuance is often lost in transmission.”

—Grzegorz S. Nowakowski, MD

It is common practice in clinical trials to see whether treatment effects vary according to specified patient or disease characteristics, but the rigor of subgroup analyses also varies, and most readers aren’t prepared to spot the differences. In the best-case scenarios, subgroup analyses show homogeneity of effect across multiple subgroups. Problems arise when these analyses are used as “fishing expeditions” in trials where no overall treatment effect is found.

“There is no ‘standard’ approach for subgroup analyses,” said Dr. Hackshaw. He suggests that running an interaction test alone – which can compare whether a treatment effect is different between subgroups, such as males and females, for example – is insufficient. A safer practice is to run both an interaction test and a test for heterogeneity, because the latter assesses whether the effects in the subgroups differ from the overall effect.

“Many researchers do multiple subgroup analyses, often encouraged or requested by journals, and few allow for the multiplicity, so chance effects could arise,” he continued. “Requiring both tests to ‘pass’ would strengthen the evidence when faced with multiplicity.”

Beyond these numerical evaluations of a subgroup analysis, Dr. Hackshaw said, “there also needs to be biological plausibility and corroborating evidence from independent studies when claiming a subgroup effect.”

Also, by their nature, subgroup analyses are based on smaller numbers of patients and events, running the risk that the balance in baseline characteristics achieved by randomization might be lost.

“The hope is that the subgroups are all consistent, but even failing to show heterogeneity doesn’t prove that it doesn’t exist,” Dr. Tannock commented. “We should beware of subgroup analyses because the trials are powered for the overall effect; subgroup analyses can show random fluctuations that can be highly misleading.” If there is a subgroup of interest, he added, a separate trial should be designed and powered to assess that.

Even when presenters note that subgroup effects are only hypothesisgenerating and should not be taken as evidence of benefit, that message can get watered down during dissemination.

Dr. Nowakowski pointed to a recent example of appropriate subgroup analysis reporting from the 2018 ASH Annual Meeting. Principal investigator Anas Younes, MD, from Memorial Sloan Kettering Cancer Center, presented findings from a phase III study that evaluated whether the combination of ibrutinib plus R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, prednisone) was associated with greater efficacy than R-CHOP alone in patients with previously untreated diffuse large B-cell lymphoma (DLBCL).4

The overall findings were negative: The addition of ibrutinib did not improve event-free survival in the entire intentto-treat population. However, a subgroup analysis showed an interaction between age and treatment effect, with patients younger than 65 years experiencing a “clinically meaningful improvement” in survival outcomes with the addition of ibrutinib. This signal of benefit is exciting because it highlights the possibility of finally improving on the R-CHOP backbone regimen in younger patients with DLBCL.

“In his presentation, Dr. Younes shared the data in an unbiased way and stressed that, although there is benefit in this subset, the subgroup finding is hypothesis-generating only and needs to be validated in a future study before we can consider implementing it in the clinic,” said Dr. Nowakowski.

“At the end of the day, you can torture the data until it confesses and there can be a lot of pressure to overstate things, but it is the principal investigator’s responsibility to present the data accurately and fairly,” he stressed.

In the study abstract, however, this subtlety was less clear. The subgroup benefit was noted in the first sentence of the conclusion and it headlined some of the news coverage aimed at professional audiences.4

Arming Yourself Against Spin

Cases like the retracted Nature paper raise the question of who bears the responsibility for ferreting out data manipulation or misinterpretation – reviewers or readers?

It’s important for clinicians to be able to read research with a critical eye, said Bob Löwenberg, MD, PhD, from Erasmus University Medical Center in Rotterdam, the Netherlands, and the editor-in-chief of Blood. But, he added, recognizing that some lack this experience, the journal will often provide expert comments on papers “to put the data in perspective and discuss the strengths and limitations of the research.”

One simple way to test the veracity of a clinical trial report: Check the protocol. “The analysis should be done according to the plan that was in the registered protocol, which we require to have been registered before the first patient was entered,” he explained.

Despite efforts to present data objectively, spin, which refers to any attempt to misrepresent study findings to influence the interpretation of those findings, is more prevalent than many realize. Examples include overemphasizing a nonsignificant finding or highlighting a secondary endpoint or post hoc finding in a trial that missed its primary endpoint.

(Editor’s note: In our reporting in ASH Clinical News, we are trying to be part of the spin solution. This includes translating complex statistics into meaningful statements, providing context for the findings presented in clinical abstracts and research papers, and identifying when a paper was prepared with outside editorial assistance or by pharmaceutical sponsors. We also do not repeat statements in print or in abstract presentations about “practice-changing findings,” or those that exaggerate clinical impact or minimize significant adverse events. Have any feedback on our reporting? Let us know at ACNeditor@ hematology.org.)

Practitioners who lack in-depth statistical training hope to rely on others – like responsible authors, journals, and editors – to weed out specious findings, but in many cases, confirmed by a randomized clinical trial, readers just aren’t as good at spotting spin as they think they are.

In 2014, Dr. Tannock and colleagues published findings from the SPIIN trial, which showed that abstracts containing spin can fool even experienced readers.5 SPIIN authors randomly assigned 300 clinician-researchers (all of whom were corresponding authors on published oncology trials) to review a sample of published abstracts in their original form with spin or versions that were rewritten to eliminate spin. This included deleting information that could distort the understanding of the trial’s aim, reporting complete results with no wording of judgment, and replacing the author’s conclusion with a standardized conclusion (e.g., “treatment A was not more effective than comparator B in patients with …”).

All abstracts had statistically nonsignificant primary outcomes, but reviewers who read the original abstracts rated the investigational treatment as more beneficial, compared with reviewers who read the rewritten versions (p=0.03; effect size = 0.25).

“To try to minimize the impact of spin and thus biased dissemination of research results, authors should be educated on how to interpret research results,” the SPIIN trialists wrote. “Peer reviewers and journal editors also play an important role; they should systematically check whether the abstract conclusions are consistent with the study results and whether the results reported in the abstract are free from bias.”

“Given that a large percentage of the large trials done today are supported by industry, researchers can feel pressured toward inappropriate reporting, and I think spin remains fairly common across the board,” said Dr. Tannock.

Still, participants who reviewed abstracts written with spin rated the study as less rigorous (p=0.034) and noted an increased interest in reading the full-text article (p=0.029), compared with participants who reviewed the rewritten abstracts. This suggests that, while readers initially can be drawn in by overstated findings and linguistic spin, they are interested in digging deeper into the findings.

“In the end, I don’t think we’re dealing with a whole bunch of people who are trying to game the system,” noted Dr. Mast. “There are a lot of reasons why information might be incorrect or misleading, and that’s why we keep doing science: to keep checking each other and repeating things and seeing what’s really right when it goes out to the real world.”

—By Debra L. Beck

References

  1. Chassé M, Tinmouth A, English SW, et al. Association of blood donor age and sex with recipient survival after red blood cell transfusion. JAMA Intern Med. 2016;176:1307-14.
  2. Vasan SK, Chiesa F, Rostgaard K, et al. Lack of association between blood donor age and survival of transfused patients. Blood. 2016;127:658-61.
  3. Edgren G, Ullum H, Rostgaard K, et al. Association of donor age and sex with survival of patients receiving transfusion. JAMA Intern Med. 2017;177:854-60.
  4. Younes A, Sehn LH, Johnson P, et al. A global, randomized, placebocontrolled, phase 3 study of ibrutinib plus rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone (RCHOP) in patients with previously untreated non-germinal center B-cell-like (GCB) diffuse large B-cell lymphoma (DLBCL). Abstract #784. Presented at the 2018 ASH Annual Meeting, December 3, 2018; San Diego, CA.
  5. Boutron I, Altman DG, Hopewell S, et al. Impact of spin in the abstracts of articles reporting results of randomized controlled trials in the field of cancer: the SPIIN randomized controlled trial. J Clin Oncol. 2014:34:4120-26.

Clinical trials generally measure one of three types of outcomes: they count people (e.g., dead or alive); they take measurements on people (e.g., cholesterol levels), or they measure the time to an event (e.g., overall- or progression-free survival).1

Below are short explanations for some commonly seen graphical representations. This information and the graphics have been summarized from “Current and evolving methods to visualize biological data in cancer research,” published in the Journal of the National Cancer Institute.2

Kaplan-Meier survival curves depict time-to-event data graphically and allow a comparison of outcomes in different groups over time. The curves account for censored observations – subjects for which time-to-event information is not available either due to loss to follow-up or nonoccurrence of the event before the study end. Kaplan-Meier estimates assume that censored individuals have the same prospects of survival as those who continued to be followed and may not account for factors that can influence collection of survival data. Censoring patients (represented by a tick mark on the graph) also reduces the sample size and lowers the curve’s reliability over time.

Forest plots show the relative treatment effect of an intervention between groups within the larger cohort. They constitute several horizontal lines (representing the 95% confidence interval) and a central symbol in the middle of the line segment (representing a point estimate that is usually the median or mean). The central line represents the null hypothesis, and the effect of treatment depends on which side of the line the central point lies on. Ideally, the graph also contains a second dotted line showing the overall measure of effect. Limitations include difficulty in decoding symbol areas (like the size of the boxes) or using large symbols that obscure other important information on the plot.

Waterfall plots, swimmer plots, and spider plots increasingly are being used in oncology trials because they offer a powerful means of visualizing individual patient responses to treatment.

Waterfall plots are commonly used to display changes in tumor measurement; each vertical bar represents an individual patient and tumor response is shown by the direction of the bars from baseline (progression if the bar goes above zero and regression if it goes below zero). Patients’ responses are typically sorted horizontally by magnitude in descending order, hence the “waterfall” effect. Waterfall plots do not show changes over time and they can lead physicians to infer incorrectly that treatment response and the magnitude of response automatically translates to patient benefit.

Swimmer (or swim-lane) plots can show multiple pieces of information about a given dataset in one plot. Again, individual patients are represented as a single bar (this time, horizontal), but swimmer plots show response, duration, and other metrics over time.

Spider plots, named because the way data are visualized creates a pattern resembling the legs of a spider, depict changes in disease burden over time relative to baseline. Each participant is represented by a horizontal line, with a longer plateau of the curve below the baseline indicating a more durable treatment response. Additional information can be included by altering the symbol of each line’s final data point or the line style and color, but these graphs can be difficult to interpret if there are numerous data points.

References

  1. Hackshaw A. (2009). A Concise Guide to Clinical Trials. Oxford: Wiley-Blackwell.
  2. Chia PL, Gedye C, Boutros PC, Wheatley-Price P, John T. Current and evolving methods to visualize biological data in cancer research. JNCI J Natl Cancer Inst. 2016;108: djw031.

In case researchers lack intrinsic motivation to avoid deceptive data reporting, social media and crowdsourced post-publication peer review provide extrinsic motivation. The PubPeer Foundation launched its PubPeer website in 2012 to do just that.1 The foundation’s stated goal is “to improve the quality of scientific research by enabling innovative approaches for community interaction.” Through the website, registered users can post anonymously as “peers” and comment on published scientific research.

The system works. On September 5, 2018, Nature published a paper that reported a new technique for delivering chimeric antigen receptor T-cell therapies in patients with brain cancers. The 27 co-authors claimed they had developed a molecule that allowed T cells to cross the blood-brain barrier and “home in” on brain tumors.2 The results were almost immediately called into question.

On October 25, Nature editors acknowledged that “the reliability of data presented in this manuscript has been the subject of criticisms, which we are currently considering.” Many of those criticisms came from PubPeer, where the paper had amassed more than 50 comments about misleading, mislabeled, and duplicative figures.3 On February 20, 2019, the authors retracted their paper “to correct the scientific literature, due to issues with figure presentation and underlying data.”

References

  1. PubPeer. “About PubPeer.” Accessed March 4, 2019, from https://pubpeer.com/static/about.
  2. Samaha H, Pignata A, Fousek K, et al. A homing system targets therapeutic T cells to brain cancer. Nature. 2018;561:331-7.
  3. Kwon D. Nature retracts paper on delivery system for CAR T immunotherapy. The Scientist. Accessed March 4, 2019, from https://www.the-scientist.com/news-opinion/ natureretracts-paper-on-delivery-system-for-car-t-immunotherapy-65488.

SHARE