This document was written by Bradford Perez, then a third-year medical student, in late March or early April 2008. Working in the laboratory of Anil Potti, Perez presented what biostatisticians describe as an excellent critique of the flawed methodology employed Duke genomics researchers.
I want to address my concerns about how my research year has been in the lab of Dr. Anil Potti. As a student working in this laboratory, I have raised my serious issues with Dr. Potti and also with Dr. Nevins in order to clarify how I might be mistaken. So far, no sincere effort to address these concerns has been made and my concerns have been labeled a “difference of opinion.” I respectfully disagree. In raising these concerns, I have nothing to gain and much to lose.
In fact, in raising these concerns, I have given up the opportunity to be included as an author on at least 4 manuscripts. I have also given up a Merit Award for a poster presentation at this year’s annual ASCO meeting. I have also sacrificed 7 months of my own hard work and relationships that would likely have helped to further my career. Making this decision will make it more difficult for me to gain a residency position in radiation oncology. As a third year medical student, these are all very important things that I have given up. As a result of these circumstances, I am spending another year of my life pursuing a more meaningful research project. The reason that I have made the decision to leave the lab and make these concerns known is because it is important that the work be done right for the sake of our patients and for field of genomic medicine.
I joined the Potti lab in late August of last year and I cannot tell you how excited I was to have the opportunity to work in a lab that was making so much progress in oncology. The work in laboratory uses computer models to make predictions of individual cancer patient’s prognosis and sensitivity to currently available chemotherapies. It also works to better understand tumor biology by predicting likelihood of cancer pathway deregulation. Over the course of the last 7 months, I have worked with feverish effort to learn as much as possible regarding the application of genomic technology to clinical decision making in oncology. As soon as I joined the lab, we started laying the ground work for my own first author publication submitted to the Journal of Clinical Oncology and I found myself (as most students do) often having questions about the best way to proceed. The publication involved applying previously developed predictors to a large number of lung tumor samples from which RNA had been extracted and analyzed to measure gene expression. Our analysis for this project was centered on looking at differences in characteristics of tumor biology and chemosensitivity between males and females with lung cancer. I felt lucky to have a mentor who was there in the lab with me to teach me how to replicate previous success. I believed the daily advice on how to proceed was a blessing and it was helping me to move forward in my work at an amazingly fast rate. As we were finishing up the publication and began writing the manuscript, I discovered the lack of interest in including the details of our analysis. I wondered why it was so important not to include exactly how we performed our analysis. I trusted my mentor because I was constantly reminded that he had done this before and I didn’t know how things worked. We submitted our manuscript with a short, edited methods section and lack of any real description for how we performed our analysis. I felt relieved to be done with the project, but I found myself concerned regarding why there had been such a pushback to include the details of how we performed our analysis. An updated look at previous papers published before I joined the lab showed me that others were also concerned with the methods of our lab’s previous analyses. This in conjunction with my mentor’s desire to not include the details of our analysis was very concerning. I received my own paper back with comments from the editor and 4 reviewers. These reviewers shared some criticisms regarding our findings and were concerned about the lack of even the option to reproduce our findings since we had included none of the predictors, software, or instructions regarding how we performed this analysis. The implication in the paper was that the study was reproducible using publicly available datasets and previously published predictors even though this was not the case. While I still maintained respect for my mentor’s experience, I felt strongly that we needed to include all the details. Ultimately, I decided that I was not comfortable resubmitting the manuscript even with a completely transparent methods section because I believe that we have no way of knowing whether the predictors I was applying were meaningful. In addition to the red flags with regard to lack of transparency that I mentioned already, I would like to share some of the reasons that I find myself very uncomfortable with the work being done in the lab.
When I returned from the holidays after submitting my manuscript, I started work on a new project to develop a radiation sensitivity predictor using methods similar to those previously developed. I realized for the first time how hard it was to actually meet with success in developing my own prediction model. No preplanned method of separation into distinct phenotypes worked very well. After two weeks of fruitless efforts, my mentor encouraged me to turn things over to someone else in the lab and let them develop the predictor for me. I was gladly ready to hand off my frustration with the project but later learned methods of predictor development to be flawed. Fifty-nine cell line samples with mRNA expression data from NCI-60 with associated radiation sensitivity were split in half to designate sensitive and resistant phenotypes. Then in developing the model, only those samples which fit the model best in cross validation were included. Over half of the original samples were removed. It is very possible that using these methods two samples with very little if any difference in radiation sensitivity could be in separate phenotypic categories. This was an incredibly biased approach which does little more than give the appearance of a successful cross validation. While this predictor has not been published yet, it was another red flag to me that inappropriate methods of predictor development were being implemented.
After this troubling experience, I looked to other predictors which have been developed to learn if in any other circumstances samples were removed for no other reason than that they did not fit the model in cross validation. Other predictors of chemosensitivity were developed by removing samples which did not fit the cross validation results. At times, almost half of the original samples intended to be used in the model are removed. Once again, this is an incredibly biased approach which does little more than give the appearance of a successful cross validation. These predictors are then applied to unknown samples and statements are made about those unknowns despite the fact that in some cases no independent validation at all has been performed.
A closer look at some of the other methods used m the development of the predictors is also concerning. Applying prior multiple T-tests to specifically filter data being used to develop a predictor is an inappropriate use of the technology as it biases the cross validation to be extremely successful when the T-tests are performed only once before development begins. This bias is so great, that accuracy exceeding 90% can be achieved with random samples. I learned this to be true some months ago and raised concerns at that time to my mentor but was once again pressured to understand that this was not inappropriate as long as ‘robust’ independent validation of the model’s accuracy exists. So far, no ‘robust’ independent validation bas been performed on any of these predictors and no independent validation at all has been performed on many of these predictors despite the fact that they are being used in descriptive studies.
My efforts in the lab have led me to have concerns about the robustness of these prediction models in different situations. Over time, different versions of software which apply these predictors have been developed. In using some of the different versions of software, I found that my results were drastically different despite the fact that I bad been previously told that the different versions of the classifier code yielded almost exactly the same results. The results from the different versions are so drastically different that it is impossible for all versions to be accurate. Publications using different versions have been published and predictions are claimed to be accurate in all circumstances. If a predictor is being applied in a descriptive study or in a clinical for any reason, it should be confined that the version of software that is being used to apply that predictor yields accurate predictions in independent validation.
A number of other predictors of chemosensitivity were developed and published before I came to join the lab. I applied the previously developed and published Affymetrix U95 based predictors for sensitivity (Potti et al., Nature Medicine, 2006) and found that in some situations there was extremely poor correlation between that predictor and a newly developed predictor for the same chemotherapeutic agent on the UI33A platform (Salter et al., PLOSOne, 2008). This kind of complete disconnect in two predictors that should be predicting the same thing is concerning and yet our lab considers them both to be valid.
Some other predictors which have been developed in the lab claim to predict likelihood of tumor biology deregulation. The publication which reports the development of these predictors was recently accepted for publication in JAMA. The cancer biology predictors were developed by taking gene lists from prominent papers in the literature and using them to generate signatures of tumor biology/microenvironment deregulation. The problem is in the methods used to generate those predictors. A dataset consisting of a conglomerate of cancer cell lines (which we refer to as IJC) was used for each predictor’s development an in-house program, Filemerger, was used to bring the gene list of the IJC down to include only the relevant genes for a given predictor. At that point, samples were sorted using hierarchical clustering and then removed one by one and reclustered at each step until two distinct clusters of expression were shown. This step in and of itself biases the model to work successfully in cross validation although an argument could be made that this is acceptable because the gene list is already known to be relevant. The decision regarding how to identify one group of samples as properly regulated and the other as deregulated is where the methods become unclear. There is no way to know if the phenotypes were assigned appropriately, backwards, or if the two groups accurately represent the two phenotypes in question at all.
Since I have been in the lab, I have worked for countless hours to apply what I believed to be valid models to predict chemosensitivity, oncogenic pathway deregulation, and tumor biology. In looking back at previous publications which claim to validate some of these predictors being used today, most validation data is either unavailable, missing clinical data or methodological methods so that validation cannot be performed, or even misrepresented. If the validation sets are not accurate on the version of the software being used today, then they should not be used to make predictions of unknown samples.
After an earlier publication which claimed to make extremely accurate predictions of chemosensitivity (Potti et al., Nature Medicine, 2006), I think that it was assumed that It was easy to generate predictors. More recent events have shown that the methods were more complicated and perhaps different than first described. Given the number of errors that have already been found and the contradicting methods for this paper that have been reported, I think it would be worthwhile to attempt to replicate all the findings of that paper (including methods for development AND claimed validations) in an independent manner. More recently, when we’ve met with trouble in predictor development we’ve resorted to applying prior multiple T-tests or simply removing multiple samples from the initial set of phenotypes as we find that they don’t fit the cross validation model. These methods which bias the accuracy of the cross validation are not clearly (if at all) reported in publications and in most situations the accuracy of the cross validation is being used as at least one measure of the validity of a given model. Also concerning is that models are being applied to describe unknown samples in situations where we are not sure that the models accurately predict what is claimed. Finally, the lack of transparency in making validation sets and methods available so that others can confirm the work is concerning.
At this point, I believe that the situation is serious enough that all further analysis should be stopped to evaluate what is known about each predictor and it should be reconsidered which are appropriate to continue using and wonder what circumstances. By continuing to work in this manner, we are doing a great disservice ourselves, to the field of genomic medicine, and to our patients. I would argue that at this point nothing that should be taken for granted. All claims of predictor validations should be independently and blindly performed. Unfortunately, since validation datasets on the supplementary website have been shown to be misrepresented in multiple situations, those datasets should be obtained from their respective sources through channels that bypass the researchers.
I have had concerns for a while; however I waited to be absolutely certain that they were grounded before bringing them forward. As I learn more and more about how analysis is performed in our lab, the stress of knowing these problems exist is overwhelming. Once again, I have nothing to gain by raising these concerns. In fact, I have already lost. As a student, I do not claim to understand the best way to go about performing this analysis; however to this point no one has shared with me why my concerns are inappropriate. I believe that a truly independent third party intimately familiar with methods of genomic predictor development and application would agree that my concerns are worth considering.