The Genomic Data Commons, NCI’s latest big data project, is poised to become a major player in oncology bioinformatics when it opens June 1.
The GDC aims to become oncology’s go-to database for comprehensive, raw genomics information. NCI officials said this sets the GDC apart from other bioinformatics projects, which are vying to play a role in the White House moonshot initiative.
“When the other groups are sharing the data, what they are doing is sharing very derived data that is divorced from the actual data,” said Louis Staudt, director of NCI’s Center for Cancer Genomics. “The GDC is doing something different.
Staudt and Warren Kibbe, NCI acting deputy director and director of the NCI Center for Biomedical Informatics and Information Technology, spoke with Matthew Ong, a reporter at The Cancer Letter.
Matthew Ong: When did the idea for the Genomic Data Commons come about and what drove the need for NCI to establish the GDC?
Louis Staudt: That was on my watch. I was asked by Harold Varmus to take over the Center for Cancer Genomics and be its first permanent director. I started learning about what I was suddenly in charge of, which was The Cancer Genome Atlas program, and the great raft of data that was being generated and how it was served up to folks.
Traditionally, this kind of data would be maintained by the National Center for Biotechnology Information, which is one of the centers here at NIH. But the size of this is so large, the amount of data so great, that it exceeded their actual limit, they basically didn’t have the ability to take this on.
The other thing that was very clear, when I started learning about how the TCGA operated, was that it was very quickly and effectively prosecuted as a project, and because of that quickness, it was happening in multiple centers in parallel often using similar, but not identical technologies and the data were analyzed in similar but yet not identical ways.
Still, we made a lot of progress in the TCGA project, even given those imperfections. But it seemed like we could do a better job of analyzing and presenting the data. We also needed a long term place for the data because, of course, the National Cancer Institute, as the steward of our public resources, needed to make the data available for as long as it’s still relevant, which is probably going to be a long time.
And what’s more, we need to keep the data freely available, so that cancer researchers can get at it. Public access to TCGA data has paid off in spades because there have been over 1,000 papers published based on the data.
So we had to do something with the data. The data was outstripping the mechanisms we had put together for storing it. I felt that it was necessary to tackle this in a very concerted fashion, working with the best computer scientists to implement uniform, state-of-the-art bioinformatics protocols and generate what we call “harmonized” data. In this way, all the data from the TCGA and from any other genomics project that we would do would be all analyzable in a common framework.
That was the genesis of the GDC. The initial groundwork was probably late 2013, when we were starting to draw up the plans. The other thing that was very exciting at the time, and remains so, is if we made a bioinformatics engine and made all that software available to cancer researchers, could they then start uploading their own data that they have generated? And the carrot would be that they get to use all of the wonderful software that had been constructed during the TCGA project and get the best possible analysis of their data. The stick would be they must share the data, and they must share it openly and publicly. So that’s the second piece that we really didn’t have before.
When we open this up for business in June of this year, we will start taking in projects from anyone in the world who has done a well-annotated cancer project that would add value to the whole.
The final thought—quite aspirational and not realistic at the moment, but exciting nonetheless—is based on the fact that we are moving towards more routine genomics for patients diagnosed with cancer, in the course of their care. If that ends up being the case and there is funding to pay for that, then what would it look like if individual cancer patients started donating their cancer genome sequence to this public repository? Then, suddenly, we’d be getting hundreds of thousands of datasets that we didn’t have to fund centrally. But given the great diversity that is the essence of human cancer, one needs to get data from over 100,000 cases to have sufficient statistical power to discover all the recurrent genetic driver events that cause cancer.
I just want to be clear, this has not been implemented yet, and there are some barriers to implementation, but it’s kind of fun to think about.
So you’re saying that when the GDC is launched, nearly anyone in the world—not just NCI-designated cancer centers—can mine data as well as contribute data, right?
LS: Yes, the GDC will be openly available. For data submission, we will look at the quality of the data and whether it’s a large enough dataset to make a difference, because everything takes time and we’re going to have to help investigators get their data in order. But yes, any sort of well conducted study that you are writing a paper about, and you have molecular data and hopefully some clinical outcome data or some sort of clinical data, is potentially appropriate for the GDC, meaning that you can use the GDC tools as long as you agree to share your data.
In the TCGA and the other projects NCI has conducted, we have collected clinical data in a very elaborate and fine-grained fashion. Keeping that all straight and serving it up in a way that is easily searchable so that you can find the cases that you are most interested in is a bit of computer science that took some work to get right.
The GDC will store the genomic sequences along with the analysis of the tumors, and also store the enrollment clinical data and treatment data for these patients. The overarching goal is to move towards a knowledge base for cancer, which will develop a clinically useful molecular taxonomy of cancer that will help us evaluate for any particular patient what the most rational course of action is.
What are the incentives for hospitals and cancer centers to come to the NCI and say “Hey, we’ve got really valuable data and we want to put that into the GDC and make it publically available.” What incentives would drive people to want to do that?
LS: There are probably many. One very practical incentive is that if researchers would like to publish a paper in most good journals nowadays, they must make their data available according to the journal requirements, and typically that is in a public database like the GDC.
Secondly, again in the realm of research, if your data were collected using NIH funds, then we have a new policy called the NIH Genomic Data Sharing Policy that went into effect January of last year [see https://gds.nih.gov and http://www.cancer.gov/grants-training/grants-management/nci-policies/genomic-data]. That says, in no uncertain terms, that you mustshare genomic data and associated experimental data.
The GDS stipulates that you must consent the patients on your study for the analysis of their samples so that their data can be shared in a controlled access environment. You must follow through on that. And we will probably figure out ways of enforcing that and making sure that people are living up to that policy.
And the final point is that there is an altruistic streak that is gaining momentum, which I have certainly embraced from the early days of genomics. When you generate this very high-dimensional data, and you write a paper telling the world about the most interesting thing you found, you know that you haven’t described all of it and that someone else may have a different perspective. Sharing the data broadly will inevitably lead to more knowledge generation.
I’ve often used other people’s datasets in my own research. I didn’t have to do anything, all I had to do is click a button and the data was on my computer. And that has been very useful in making my own data richer, allowing me to interpret it better. I think that publically funded data should be made publicly available. Data from tumors and cancer patients is critical to solving problems in cancer, which is what we’re all about. We’re hearing this perspective from many corners, including AACR and ASCO, and NCI is trying to be a leader in data sharing—certainly that’s been loud and clear in many discussions about the moonshot.
Warren Kibbe: From a position standpoint, what we’d really like to see is all the groups that are out there collecting these kinds of data, that they actually have a place available to them to share it. And along with genomic data, they need to be able to share all the clinical information, the phenotype around the genomic information, in a way that is well-described, well-defined, and accessible for data sharing as well. I think the GDC is a great platform for people to do data sharing and that’s really what it’s designed to enable.
One of Vice President Joe Biden’s goals for the National Cancer Moonshot Program is breaking down data silos and establishing a central data repository. I know that many are hoping to be that central repository. In that context, what role do you see the GDC playing in the moonshot program and how will it achieve that goal?
LS: We’re really happy that other groups are trying to do this and working through some of the impediments that naturally come up with the sharing of data. I think that, ultimately, if they come up with a way that they’re going to share their data, then I guess they can share it with the GDC and we can help distribute it. The one thing I think is a little special about the GDC is that there will be only one such NCI-supported system. That doesn’t mean there won’t be a lot of other activities, some of which Warren is coordinating—that will be computational genomics to help understand what all the data means.
But, as a data repository, it’s unlikely we’ll be able to fund a large number of long-term, large data repositories. So I think that the permanency that is inherent in what we’re trying to achieve in the GDC would be seen as a real benefit to anyone who’s trying to share data.
So I applaud them for doing it, working through the problems, making discoveries, and publishing them. We’d like the GDC to help maintain the data and make it continuously available.
At a roundtable at Duke, Biden expressed his astonishment at the number of duplicative efforts in oncology bioinformatics. How is the GDC different from others?
LS: There are certainly many exciting bioinformatics approaches to find drivers for cancer and display the data.However, the GDC is unique in many ways, and I’ll tell you an important one. We are keeping all the raw data and we keep it in a controlled access way that allows researchers who have the permissions and acumen to look at it. What these other projects are doing—full disclosure, I’m on the external advisory committee of the GENIE Project, so I know about that project and I know also about the others—is that they are going to try to share the somatic genetic changes in the tumor. They may not know whether a mutation was in the germline and often won’t have precise information about the quality of the data that underlies the determination that there is a mutation. One of the things about keeping the raw data is that better and better algorithms and tools in bioinformatics are being developed rapidly. Based on the type of data that comes off of our high-throughput sequencers, it’s not also clear whether or not there is a mutation, but having the raw data will allow us to implement uniform, well-defined analytical pipelines to make the best determination possible.
Largely, the difficulty arises from the incredible complexity of human cancers. Some mutations are only in a minority of the cells within the tumor—let’s say 2 percent or 5 percent of all the tumor cells have a particular mutation. At this frequency, the real data are approaching the noise in the system, so we have to solve a signal vs. noise problem.
So when the other groups are sharing the data, what they are typically doing is sharing very derived data that is divorced from the actual data. The GDC is doing something different. We enable researchers to embark on a perpetual cycle of improvement and reanalysis of data, ever increasing in precision and scale. Other projects, at least as currently defined, will only include the results of analyses, and in many cases, we won’t be able to say whether a particular algorithm that was used might have missed something, or incorrectly called a mutation. That is not trying to denigrate what others are doing—what they’re doing has real value—but I’m just trying to distinguish what we’re doing.
That said, we’re excited about all efforts to discover important associations between variants and clinical responses and would like to offer the GDC as a useful and permanent venue to share the data.
What is the status of conversations between NCI and the vice president’s office, as well as the other bioinformatics groups? What responses have you been receiving from both of those parties?
LS: Everybody likes the GDC. I certainly know that our director, Doug Lowy, has mentioned the GDC on a number of occasions. It is clear that the GDC is going to be part of the solution.
WK: When we’ve been talking about the GDC and the cloud pilots, and both of those come up with the vice president’s office, they’re very interested in how we can really use what we’ve been developing to push the cancer data sharing agenda.
What the vice president’s been saying is we need to make these data available, and discoverable, and we need to learn from all the data that’s been generated across the country. I think that the way we’ve tried to describe the GDC, relative to all the rest of the projects and initiatives that are out there, is, as Lou said, that the GDC has the raw data. Another advantage for the GDC is consistency—all the genomic data that gets submitted to the GDC is run through a consistent analysis pipeline.
Because we get the raw data, we also make sure that it passes a certain QC threshold, and that when it doesn’t, we can annotate that as well. Again, there’s a piece of what the GDC brings to the table that is distinct from, for instance, trying to aggregate all the genomic data or all the panel sequencing data from every hospital in the country.
LS: Although the special sauce of the GDC is having the raw data, if someone wanted to drop, say, 10,000 cases on us with tumor resequencing and just provide the somatic mutation calls, we will gladly make that available and searchable, and use all the tools of the GDC to be able to look at that data in the context of all other GDC data.
The one thing that I think we need to emphasize is that the data that are generated at these centers largely are resequencing data, looking for mutations—typically, that’s all they’re looking for. That is not sufficient to fully describe the molecular nature of a cancer. At a minimum, you also need to describe the activity of the genes as read out by the messenger RNA levels, which gives you part of the phenotype of the cancer cell, as well as chromosomal copy number changes – too many copies of a gene or deletion of a gene—and rearrangements of a gene.
Largely, the panel sequencing that’s being done routinely at the centers is not looking at that sort of multi-modality data, and it’s only by embracing all those data dimensions that we are going to provide a full molecular picture of a tumor, which will be important for patients. It’s not only, “Yes, you have a mutation in a gene for a particular pathway that may make this cancer susceptible to a drug.” If you have high-level amplification or high expression of a gene in the same pathway, that tumor can be just as addicted to that signaling and just as killed by that particular drug.
It’s this multi-platform integrative analysis that is also a special aspect of the GDC.
Warren, you mentioned that the vice president is excited about the GDC, do you know the VP’s office has indicated or in any way said, “The GDC looks like our solution”?
WK: Well, we’d love it if that were true. I’m not sure that the vice president himself has heard about the GDC. We’ve certainly given that information to the vice president’s office.
Also, in the GDC, it’s not just the pipeline for analyzing all the data, it’s also how you can find it and visualize it. There are actually a lot of tools that are available to help people make sense of the data they are generating. You asked what’s the secret sauce for getting people to share their data—one part of that, of course, is the Genomic Data Sharing Policy.
If you get NIH funding for generating cancer genomics data, there is an expectation and a requirement that you share those data. When the GDC opens, the requirement will be that for NCI-funded investigators generating genomic data, the data has to go to the GDC.
But, just as important—and this is the carrot side of it—you get access to this incredibly well defined computational infrastructure that has all the computer science behind it that we currently understand and have vetted for analyzing genomics.
That’s a tremendous value for the community, and we’ll see how that plays out. We’ve been getting a tremendous amount of interest from a number of projects—can they deposit their data in the GDC to make it available to the community? That’s exactly what it’s there for.
What is the timeline? When will people be able to see the GDC?
LS: It’s going to open up on June 1 of this year. We’re hoping that we can withstand the surge, and we feel the usual fear and trepidation that goes into opening up a big data system. Something that I’ve appreciated is how complicated this has been from a computer science point of view, but the team is doing a great job and we’re doing lots of user testing with interested parties, so it won’t be just an untested system when it opens up.
Then, what will be relatively untested and involve some growing pains is to learn what it means to take in other people’s data. The only difficulty there is we don’t really know what the data will look like. By definition, everybody could be using a different data provider, and the data will be formatted in different ways. We’ve been working hard at generating controlled vocabularies of how the data must be submitted, but there’s going to be some handholding—that’s going to be part and parcel of what we will need to do to get the data in.
So we will start the process of accepting outside data on June 1, but I don’t think we’ll be able to get the first data in for a month or two. It will take us that amount of time to figure out exactly how to get it in so it’s fully correct.
WK: It will also depend on how big those projects are, which will influence how long it will take to get the data into GDC. With TCGA itself, just getting a petabyte or so data over a 10-gigabit line took several months. While I don’t think there are a whole lot of projects that size that are out there waiting to deposit data in the GDC, there can be a delay in getting the data in. It’s not something that happens overnight where you flip the switch and suddenly there’s a whole big dataset there.
Actually, a lot of work goes into importing the data and resolving all those ambiguities in the data. There are always far more questions about the data than it seems like there ever should be, considering that it’s all generated with machines, but it’s there, it’s real, and it does take people paying attention to fix all of it.
Is caBIG relevant to the GDC? Are there any lessons to apply here?
LS: It’s very relevant. Bite off what you can chew. Before we even put out the contract announcement for this, we spent nine months of regular weekly meetings working on developing a statement of work for the GDC. We really specified exactly what we wanted the GDC to accomplish. The GDC development team knows this is a contract and not a grant. We said, “This is what you’re going to deliver,” and they’re delivering it.
There were things that we thought about that seemed too difficult to put into the GDC at the very beginning and so they’re not there, and they will be coming in after June 1. The GDC will continue to develop at a fairly rapid pace over the two years following its opening in June—it will keep improving even after that.
So it was important to think a lot about the scope of what could be accomplished and the resources it would take to do that, and hold ourselves accountable for deliverables along timelines that were realistic but would get the product out in a time that is appropriate for the need. That is, we didn’t want to take more than two years to get something done that was useful, and that’s exactly what the GDC team has had—two years.
By the way, the GDC team has been doing an absolutely great job. GDC development is led by Bob Grossman at the University of Chicago [director of the GDC project, chief research informatics officer in the Biological Sciences Division, and professor of medicine at the University of Chicago], with very valuable contributions to the front end user interface by a team at the Ontario Institute for Cancer Research. We also have a really excellent team at our main NCI contractor, Leidos, who has been managing the GDC development along with all of us at NCI. It’s been team science putting this all together.
The big thing is that we didn’t try to do everything you can do in informatics. We had a very well delineated task in front of us, and we figured out a plan to accomplish it.
WK: The only thing I would add that is crucial and is basic to what Lou does—it’s second nature—is the engagement of the whole research community in this. He was talking about TCGA; TCGA investigators have been really involved in helping to specify what the GDC needs to look like, as have the TARGET teams.
So what’s been phenomenal is that TCGA started as a relatively insular project—nobody really knew how it was going to turn out—but by now, eight years later, it involves a huge community of researchers. They really rely on the TCGA data for many things; they’ve also contributed to it in fundamental ways. That’s not just the folks that have grants and contracts around TCGA, it really is the whole genomics and cancer research community.
That involvement of everyone really makes this project much better, and of course, it’s hard to manage all of that, and Lou’s group really does a tremendous job managing the different stakeholders and making sure what is going into the GDC meets the community’s needs. That’s a really important lesson.
LS: Warren’s right—that’s a lesson learned from TCGA as well as how I do business in general. Team science is something that should be embraced and probably a bit of a contrast with the past.
Can you go into more detail the GDC’s arrangement with the University of Chicago and OICR through Leidos? How are they uniquely qualified to develop and manage the GDC?
LS: Good questions. First of all, this was a competitive process, as it always must be. We got quite a few good applications, and the Univ. of Chicago proposal was deemed by an outside panel to be the best from a number of perspectives. Bob Grossman himself is a national leader in big data and has brought a lot of expertise.
The actual computer science construction of the GDC is unlike any database that’s held cancer genomics before. It’s not a standard, typical relational database from Oracle, it’s a different, modern version. It allows much more flexibility in dealing with these data that have many connections that need to be maintained with one another. So Bob has brought that approach from his computer science background.
The way it works is this: Leidos is our contractor, and they subcontract the work out to the University of Chicago, the primary subcontractor. OICR has been doing a lot of the informatics for the International Cancer Genome Consortium, especially the front end website for browsing the ICGC cancer genomics data. Bob naturally turned to them as a sub-subcontractor for the project to bring in that web development.
OCIR has been working with the University of Chicago team to optimize very complicated queries of the enormous amount of data in GDC, so that, even if it’s not instantaneous, you’re not going to wait too long for an answer. That’s computer science. We needed somebody bringing that to the GDC, with an emphasis on the science part of it.
Some time was spent testing out systems that were purported to be good and then when we tried to deploy them on the petabyte scale, they failed miserably. So we said, “Okay, scrap that, we’ve got to go to a different system.” So there’s been a learning curve during the development of the GDC. That, in a nutshell, is an example of how the team has been doing a great job.
Leidos is obviously critical to a lot of what we do here at NCI. They allow us to develop projects that must necessarily occur over a period of time—this mechanism is ideal for longer-term projects like the GDC, with its initial phase of two years, followed by option years in which the GDC is improved and extended.
I want to commend the actual leads at Leidos that we brought in to help manage it. Developing a complex computer system requires active management: What’s done now? What’s done later? Where are we currently? Are we behind on this task? What priority is most pressing now? The Leidos team has been very important in doing that and has been instrumental to the success.
What is the budget for the GDC for the 2016 fiscal year? Is it funded through the CCG?
LS: I can tell you what it has been. Through the two-year period, it’s been $20 million, and we’re just working on the budget for the out years at the moment.
What progress has the GDC made since the initial announcement in December 2014? Where is the project now?
LS: We’ve made a lot of progress. As Warren mentioned, the data are enormous. Just to get the data in from where it’s currently sitting took months, and there’s the added complexity of fixing certain aspects of data that weren’t quite right. That took a long time. The second big task was mapping the sequencing data to the genome. As you know, with high-throughput sequencing, you get millions upon millions of short DNA sequences from each cancer and from the normal DNA of the same patient. Each of these sequences needs to be placed by an algorithm somewhere in the genome, which a pretty big territory—it’s 3 billion base pairs.
Even to analyze one sample, it takes hours to a day, so the team at Chicago figured out a way parallelize that process, allowing them to process data from roughly 11,000 cases from the TCGA, another 1,000 to 2,000 pediatric cases from TARGET, and a smattering of 500 or so cases from various other projects. The genome mapping of all of these data has been successfully completed over several months.
By the way, all of the software is becoming standardized—the official term, I learned, is dockerized—which means to a computer scientist that they can just take that code and implement it readily on their computer, and it will work. That’s remarkable, because usually a computer system is highly tied to the particular architecture of the operating system. So all of these methods that we’ve developed to do this big data analysis are themselves publicly available and reusable by the community.
One other basic task, that was not at all trivial, was to get all of the very complicated clinical data from our various cancer projects in order and searchable. That was a big curation effort.
The next question is, “Alright, what’s there? What are the genetic abnormalities?” You would think that would be easy, but in fact the science on that is still progressing, and we determined that, at present, there’s no one right answer to whether or not there is a mutation in a cancer. We’re implementing three different mutation callers, all respected, all have their pluses and minuses, and we will make all of them available to the users of the data—that’s also something that’s unique to the GDC.
I don’t want to make you feel it’s all a mess: the majority of the mutations are called equivalently by all three callers, but on the edges, each one of them will call some other mutations that are actually real and missed by the other callers. As I mentioned earlier, this is due to the complexity of human cancers. A tumor biopsy does not only contain the malignant cells – there are infiltrating immune cells and other cells that dilute the mutational signal from the malignant cells. Secondly, tumors can have minor subclones that differ genetically from the major clone, but the mutations in the subclones are only found in a very low percentage of the sequencing reads. This percentage can approach the error rate of high-throughput sequencing, so you can be confounded by experimental noise. We have implemented two mutation calling pipelines and are close to finishing the third pipeline, and each of them takes quite a lot of time, weeks, to finish processing all the data.
The final big task is to make what we hope will be a really useful, attractive browser for the data that will allow you to identify cases by their clinical attributes or molecular features, and focus your analysis of those cases, and download the data to your computer for future analysis if needed. That frontend browsing interface has taken a lot of effort to get right. Equally challenging has been our efforts to optimize the experience of people trying to upload data to the GDC. They have to get their data formatted correctly. How can we make that easier for them to do? That’s been a big part of our work.
It looks like we’re on track to meet our goal, our deadline. This is a work in progress—even after GDC opens, we’re going to be adding features on a monthly basis. Nonetheless, we’ll have a lot of functionality when it opens.
Since many of our readers consist of faculty at academic cancer centers, could you explain who has access to the data? How accessible and easily usable will the data be? Also, how will NCI maintain the quality of the data?
LS: There are two types of data, controlled access data and open access data, and the type of data differs by each project and depends on how the consent for the patients was set up. In the TCGA project, those mutations that are in the protein-coding segments of the genome and are deemed to be somatically acquired—not in the germline of that patient but only in the tumor—are open access. This means that they will be available to anybody, and GDC will provide browsing tools to help people see where they are in genes, and what types of tumors have which types of mutations.
The second type of access is for controlled access data, which means that you apply for access through the standard NIH mechanism, which is the dbGaP system. For this you typically have to be a qualified scientist at a research institution, provide a plan of what you would like to do with the data, abide by a data use agreement that has been established for the project, and agree, for example, not to redistribute the data on the Internet or do anything else that would violate the privacy restrictions attached to the data. Then, once you have dbGaP approval, you can work with the controlled access data in GDC.
In terms of the GDC experience for the cancer researchers, we hope GDC will be useful when we launch and steadily improve over time. We have already implemented a browser that has a very menu-driven, clickable way to choose cases based on the stage of disease or anatomical location, age of the patient or a variety of other clinical characteristics. Or you can do it the other way around: you can say: “Show me all the cases that have mutant KRAS and their associated clinical characteristics.”
There will be some browsing functionality at the get-go. You will be able to find data, download it, and visualize it to an extent, but advanced visualization tools take time to develop and will be continuing to improve over the next two years.
By the way, we will entertain all sorts of improvements from the community and take suggestions for tools that might belong in the GDC. Warren is in charge of another big project, called the NCI Cancer Genomics Cloud Pilots, which is developing a bunch of tools to work on these data. If some of them look really useful and give us a new view of the data, then we’ll just make them an integral part of the GDC.
Does NCI have any plans right now to interface with other databases? What is NCI’s approach to interoperability, and will the GDC be interoperable?
LS: We’ve been envisioning that we want the GDC to interoperate with other systems as much as possible. It was a condition of the contract that the GDC should maintain awareness of international standards and interoperate to the extent warranted. There is an international group of investigators called the GA4GH—the Global Alliance for Genomics and Health—who have been thinking a lot about how to develop standards for accessing genomic data. Given the complexity of setting standards on a global scale, there is a real appreciation that what we’re developing in the GDC might help drive the conversation in GA4GH as much as anything else.
Right now, there’s not a standard coming from the GA4GH for a lot of the types of data that we’re implementing in the GDC, and we think that some of what we have developed will be useful for generating those global standards. We’re not alone in wanting to analyze cancer genomic data: many other countries have made big contributions in this area, and therefore, we need a way to interoperate with them.
For a variety of reasons, it is somewhat unlikely that there will be one large international database of cancer genomics—I think there are some impediments that are difficult to get around—but a virtual way of accomplishing the same goal is something that a lot of people are envisioning. For researchers who have access privileges to data in several repositories, it may be possible for them to send computer programs to the location of data, derive results, and bring them into a common workspace. We would be very interested in helping to support that kind of global sharing of the data.
It seems like a lot of discussion in oncology bioinformatics is moving towards creating a common set of standards for good curation of data. Will GDC become a gold standard for genomics data?
LS: I think that’s a little strong. We will lead by example, and endeavor to make useful tools, all of which will be open source and hopefully easy to implement. If our tools are good, then people will adopt them, and if someone comes up with a radically better tool, we will adopt it and make the GDC better. But, we do hope that we have been at the leading edge of development of informatics for cancer genomics, and that some of what we have done will drive the conversation.
Other groups are doing things that we’re not doing. For example, the ASCO system concerns data from clinical practice and that’s quite different from what we’re going after here.
WK: I think what has changed is that the data structure isn’t the big deal. It’s actually how you attach metadata identifiers to the data so you can really tell what it is, what the particular data element is. What’s really changed in the computer science world is the data structures you pick are less important than the identifiers.
That’s something that’s important as we try to build interfaces with data from other groups—they’ll be able to expose their data with identifiers, enabling interoperability. True interoperability has been really hard to achieve in the clinical setting but it really seems like we’re on the verge of being able to do that in cancer genomics.
The GDC, as it’s being built, should enable people to explore the data in many different ways and discover what’s inside the GDC, or contribute to it. Every data item in GDC has a rich collection of metadata around it. The ability to add new identifiers around a data element completely changes how you can interoperate between different systems.
The GDC is built in a very different way than anything that NCI’s ever built before and is very different from most cancer genomic systems that are still being built. It is being built on big data principles, meaning that there’s enormous flexibility in being able to take in different kinds of data and have them interoperate with each other.
LS: What it means is that every data element is a separate file and object on the computer, so that there’s no traditional relational database. Everything object can be directly related with all the others since, as Warren said, you know exactly what each object is by virtue of its metadata. This is the only way that you can deal with the complexity of relationships among GDC data elements. This open data structure allows us, for example, to relate a patient to a treatment regimen, to biopsy samples that came from the patient, to DNA or RNA preparations that came those samples, and to the several different types of genomic analyses that were performed using that DNA and RNA.
Just by that one sentence you can see how complex the relationships can be within cancer genomics data. The modern data structure that is implemented in the GDC is the only way to deal effectively with these relationships at the scale of two petabytes of data. In other words, you will not have to come back after lunch to get the answer to your GDC data query.
Since this is going to be a permanent database for the long term, where do you see the GDC five, 10, or even 15 years from now? What do you envision this to be?
LS: I think the really cool part is to transform the GDC into an actual knowledge system in which we incorporate not only the raw data from patients’ tumors, but also information that’s relevant for interpreting the functional importance and clinical relevance of the genetic changes in the tumors. I’ll just focus on mutations for simplicity, but the same could go to a lot of other molecular abnormalities. Some mutations are important for the cancer process and contribute to malignancy. These mutations are called drivers. There are a lot of additional mutations in a tumor that are actually just along for the ride and do not contribute to malignancy—they’re called passengers.
We’re at a very rudimentary stage of distinguishing these two types of mutation. Currently, there’s a bit of a growth industry among cancer researchers that entails the testing of mutations identified in large scale projects like TCGA in a variety of different functional assays in order to help determine which ones are drivers and which are passengers. We can put all of that information into the GDC and use it to annotate the genetic data from tumors.
A second major goal is to accrue cases into the GDC that have better clinical outcome data, allowing more informative associations with genomic features. In the TCGA project and in TARGET the primary goal was to perform genomic analysis of a large number of tumors in order to describe the genetic landscape of cancer. A secondary goal in these projects was to get clinical data.
Typically these data were not collected in a controlled fashion, as would happen in a clinical trial, and the length of follow-up following treatment was not as long as you would like. For this reason, we’re initiating a number of new projects in the NCI Center for Cancer Genomics to perform genomic profiling of tumors from several NCI-sponsored clinical trials. Many of these trials have completely accrued and have mature outcome data, allowing us to ask important questions about which molecular alterations cause tumors respond well or poorly to particular types of therapy.
By implementing multi-modality genomic characterization and capturing pristine clinical data from the clinical trials, we’re hoping to identify the genetic lesions that dictate response or resistance to treatment, thereby fostering precision oncology.
Very aspirationally, we imagine that the GDC could be used as a reference system for the molecular diagnosis of cancer in a way that influences the care of individual cancer patients. Developing the necessary knowledge base for this will be the focus of research for many years, requiring much more data regarding the relationship between genomic alterations and treatment response. In the future, we imagine that clinical tests could use GDC data and methods to recommend a course of treatment for a patient. Of course, such tests would have to be performed in an appropriate clinical laboratory environment, with approval by regulatory agencies like the FDA.
That’s why we’re making the GDC—to change, for the better, how patients get through their problem with cancer.