NCI is preparing to open the Genomic Data Commons, a $20 million big data endeavor aimed at making raw genomic data publicly available.
The GDC, NCI’s largest bioinformatics effort since the ill-fated caBIG, will go live June 1. The database will be interoperable and publicly available to qualified researchers. Anyone will be able to submit data for consideration.
While work on the GDC began over two years ago, the initiative is being launched at a time when leading oncology groups are positioning themselves to play a central role in the White House’s moonshot initiative.
“The GDC is unique in many ways, and I’ll tell you an important one: we are keeping all the raw data, and we keep it in a controlled access way that allows researchers who have the permissions and acumen to look at it,” said Louis Staudt, director of the NCI Center for Cancer Genomics.
The Cancer Letter had a conversation with Staudt and Warren Kibbe, NCI acting deputy director and director of the NCI Center for Biomedical Informatics and Information Technology, which appears on page 1.
Planning for a large, more permanent data commons began in late 2013, when it became clear to NCI officials that the amount of cancer genomics data was so immense that it exceeded the limits of CBIIT’s data systems.
The GDC is funded as a subcontract awarded by the NCI’s principal contractor, Leidos Biomedical Research Inc. Leidos also runs the institute’s Frederick National Laboratory for Cancer Research.
In 2014, NCI issued a “best value competitive solicitation” for subcontractors and selected the University of Chicago to develop the GDC.
Robert Grossman, chief research informatics officer and professor of medicine at the University of Chicago, is leading the project. Grossman recruited the Ontario Institute for Cancer Research to build the GDC’s user interface.
About 25 reviewers, who made up a source evaluation group, were convened to oversee the acquisition process. NCI officials said the reviewers are experts from government, public, private, commercial, and educational institutions.
“It is Leidos’s practice not to release the detailed steps and names of individual reviewers involved in acquisitions, for reasons of confidentiality,” Staudt said.
The overarching goal of the GDC is to establish a clinically useful central repository of the molecular taxonomy of cancer, a consolidated data portal that will integrate and store the diverse datasets from CCG’s programs.
The database will contain genomic sequences and analyses of tumors, as well as clinical data on enrollment and treatment.
Initially, the GDC will house data from: Cancer Genome Characterization Initiative; The Cancer Genome Atlas; Therapeutically Applicable Research to Generate Effective Treatments, or TARGET; and The Cancer Cell Line Encyclopedia.
Eventually, the GDC will be available as an access point for data from other cancer genomic initiatives. Researchers will be able to use the database to mine information from the GDC and combine it with data from their own research or with data obtained from third parties.
“I think the GDC is a great platform for people to do data sharing and that’s really what it’s designed to enable,” CBIIT’s Kibbe said to The Cancer Letter. “What we’d really like to see is all the groups that are out there collecting these kinds of data, that they actually have a place for them to share it.”
Staudt said the GDC’s ultimate objective is to set standards on a global scale.
“Everybody likes the GDC,” Staudt said to The Cancer Letter. “Given the complexity of setting standards on a global scale, there is a real appreciation that what we’re developing in the GDC might help drive the conversation in the Global Alliance for Genomics and Health as much as anything else. It is clear that the GDC is going to be part of the solution.”
The GDC will be valuable to cancer researchers, said Joyce Niland, chair of the Department of Information Sciences, chief research information officer, and associate director for cancer informatics at City of Hope Comprehensive Cancer Center.
“I think our investigators would use the GDC—it’s useful in that it combines all these different efforts,” Niland said to The Cancer Letter. “I could see us participating in data sharing with the GDC, as long as we have all the right security, safety, confidentiality and permission protocols in place.
“Raw data is highly valuable. That’s what makes the GDC so critical. You can’t do new experimental analyses without the raw data.”
The GDC appears to be a more targeted effort compared to caBIG, said Niland, who is not involved in oversight of the GDC.
“I think lessons were learned from caBIG,” Niland said. “One of the things caBIG was trying to achieve—I don’t think we ever got there completely, it’s very difficult—is to standardize the phenotypic data and the clinical data and to collect that data in an encoded way.
“Genomic data can only be highly valuable and completely useful with the rich phenotypic and clinical data that goes with it—NCI seems to be doing that with the GDC, which is good, and hopefully the extent of these data will be sufficiently rich,” Niland said. “You can detect certain patterns and variations with genomic data, but to truly interpret that variation that defines precision medicine and to work on the moonshot, you really need to know what the clinical data and information are.”
Several groups are involved in overseeing and reviewing the GDC project: the GDC Steering Committee, the GDC Bioinformatics Advisory Group, GDC Subject Matter Experts, Leidos Biomedical Research Inc. team, and NCI leadership.
The GDC Steering Committee serves as the primary oversight body. Composed of members from academia and cancer research centers, it reviews GDC activities and resources and provides guidance.
The GDC Bioinformatics Advisory Group and Subject Matter Experts provide the GDC with advise on bioinformatics pipelines supporting DNA and RNA sequence alignment to the genome, and the generation of higher-level data such as germline variants and somatic mutations, expression levels of messenger RNAs and microRNAs, and DNA copy number alterations.
The Leidos team and members from NCI leadership support the overall management and execution of the GDC.
The names of the individuals involved in providing oversight for the project appear below.
NCI: We’re Doing Something Different
NCI officials say they hope the GDC will help meet the goals of the National Cancer Moonshot Program, a $1 billion initiative led by Vice President Joe Biden.
The moonshot, announced by President Barack Obama Jan. 12, aims to conduct a decade’s worth of cancer research over the next five years—primarily by breaking down data siloes and facilitating the creation of a central bioinformatics database for oncology (The Cancer Letter, Jan. 22).
The administration’s proposal establishes a game plan for spending the funds: the moonshot initiative will begin with $195 million in cancer research at NIH in fiscal 2016, according to the White House.
Though initial funding is relatively modest by comparison with the overall federal spending on biomedical research, the moonshot is shaping up as a broad-based research and public health initiative.
The administration’s budget proposal for the 2017 fiscal year would allocate $755 million in mandatory funds for new cancer-related research activities—$680 million for NIH and $75 million for FDA. The remaining $50 million is expected to fund Centers of Excellence in the Departments of Defense and Veterans Affairs.
“When we’ve been talking about the GDC and the cloud pilots, and both of those come up with the vice president’s office, they’re very interested in how we can really use what we’ve been developing to push the cancer data sharing agenda,” Kibbe said to The Cancer Letter. “What the vice president’s been saying is we need to make these data available, and discoverable, and we need to learn from all the data that’s been generated across the country.”
NCI officials say the GDC is distinct from other prominent initiatives that are on Biden’s radar, including:
• The American Society of Clinical Oncology’s CancerLinQ. Launched in 2010, CancerLinQ is expected to use patient care data from millions of physician and patient records from practices and hospitals to provide feedback and clinical decision support to care providers. When the system is completed, doctors will be able to receive personalized insights based on up-to-date findings (The Cancer Letter, Feb. 20, 2015).
• The American Association for Cancer Research’s Project GENIE, for Genomics, Evidence, Neoplasia, Information, Exchange. The initiative, a multi-phase data-sharing project designed to improve clinical decision making, includes AACR and seven institutions in genomic sequencing.
• ORIEN, the Oncology Research Information Exchange Network, founded by Moffitt Cancer Center and The Ohio State Comprehensive Cancer Center. ORIEN is a self-governed alliance of NCI-designated cancer centers built around a standard consenting and processing protocol called Total Cancer Care (The Cancer Letter, March 13, 2015).
Unlike GDC, other initiatives primarily limit their data to somatic genetic changes and mutations, Staudt said.
“They may not know whether a mutation was in the germline, and often won’t have precise information about the quality of the data that underlies the determination that there is a mutation,” Staudt said. “One of the benefits of keeping the raw data is that we will be able to implement better and better algorithms and tools in bioinformatics as they are being developed.
“So when the other groups are sharing the data, what they are doing is sharing very derived data that is divorced from the actual data,” Staudt said. “The GDC is doing something different.
“We enable researchers to embark on a perpetual cycle of improvement and reanalysis of data, ever increasing in precision and scale. Other projects, at least as currently defined, will only include the results of analyses, and in many cases, we won’t be able to say whether a particular algorithm that was used might have missed something, or incorrectly called a mutation. That is not trying to denigrate what others are doing—what they’re doing has real value—but I’m just trying to distinguish what we’re doing.
“That said, we’re excited about all efforts to discover important associations between variants and clinical responses and would like to offer the GDC as a useful and permanent venue to share the data.”
Major players in cancer informatics should collaborate, share data and create common standards, Niland said.
“It would be great to bring all these initiatives together,” said Niland. “City of Hope is sharing data with ORIEN, which is matching the clinical data with the tissue data and the genomics that result from the molecular profiling and having the whole package.
“I think ORIEN could contribute to the moonshot as could the GDC, but we all need to come together and use common standards. We need to interoperate and share and integrate across the initiatives, have one contribute to the other. Down the road, that would be ideal.
“We shouldn’t be saying, ‘Oh there are so many standards, I don’t know which one to choose.’ There should be one standard, although this is very difficult to achieve.
“The GDC has a very impressive advisory group here, I hope they would reach out to other initiatives and come to a consensus internationally.”
How the GDC Works
The GDC splits its data into two categories: controlled access and open access.
The open access data includes mutations discovered in TCGA that are in protein coding regions of the genome and were deemed to be somatically acquired, i.e. only in the tumor, not in the germline of the patient.
Controlled access data can only be used by qualified academic researchers, who have to apply for access through the dbGaP—the NCBI’s database of Genotypes and Phenotypes. These investigators are required to provide a research plan, abide by a data use agreement, and agree not to redistribute or violate the privacy aspects of the data.
“We have already implemented a browser that has a very menu-driven, clickable way to choose cases based on the stage of disease or anatomical location, age of the patient or a variety of other clinical characteristics,” said Staudt. “Or you will be able to do it the other way around: you can say, ‘Show me all the cases that have mutant KRAS and their associated clinical characteristics.’
“We are implementing three mutation callers in the GDC. The majority of the mutations are called equivalently by all three algorithms, but on the edges, each one of them will call some other mutations that are actually real and missed by the other callers.
“We have already implemented two mutation-calling pipelines and are close to finishing the third pipeline, and each of them takes quite a lot of time—weeks—to finish processing all the data.”
The GDC will be functional by the June 1 launch, but NCI will continue to add features on a monthly basis.
“We will start the process of accepting outside data on June 1, but I don’t think we’ll be able to get the first data in for a month or two,” Staudt said. “It will take us that amount of time to figure out exactly how to get it in so it’s fully correct.
“There were things that we thought about that seemed too difficult to put into the GDC at the very beginning and so they’re not there, and they will be coming in after June 1. The GDC will continue to improve at a fairly rapid pace over the two years following its opening in June—it will keep on improving even after that.”
Interoperability and Past Lessons
The GDC is built to be interoperable, Staudt said.
“We’ve been envisioning that we want to interoperate with other systems as much as possible,” Staudt said. “It was a condition of the contract that the GDC should maintain awareness of international standards and interoperate to the extent warranted.
The devil is in the details, because it’s the annotations and data identifiers that matter the most, Kibbe said.
“What’s really changed in the computer science world is the data structures you pick are less important than the identifiers. Every data item in GDC has a rich collection of metadata around it,” Kibbe said. “The ability to add new identifiers around a data element completely changes how you can interoperate between different systems.”
That level of granularity is the only way to deal with the complexity of relationships within the data: from a patient to the treatment and the tumor sample, as well as five different types of analysis that is performed on each DNA and RNA.
“What it means is that every data element is a separate file and object on the computer, so that there’s no traditional relational database, so to speak,” Staudt said. “Everything object can be directly related with all the others since, as Warren said, you know exactly what each object is by virtue of its metadata.”
The biggest challenge in oncology bioinformatics is annotating phenotypic and clinical data as rigorously as genomic data, Niland said.
“My background is primarily in clinical research and I realized that you can’t do as much as you’d like to do with genomic data without that phenotypic and clinical data,” Niland said. “You can’t make sense of a person’s genomic information if you don’t know exactly what disease they had, what stage, what comorbidities they had, what treatments they received, and what the outcomes were.
“If you really want to interpret the data and find patterns and associations and adjust for covariates and all that, clinical outcomes data etc. needs to be as rich and as standardized as possible. I think that’s the biggest challenge for the moonshot.”
Staudt and Kibbe said that NCI has learned valuable lessons from caBIG.
Launched about 14 years ago, the $350 million bioinformatics venture went beyond its original mission of making it easier for cancer researchers to exchange data and attempted to fulfill two clashing missions: (1) setting the standards for computer tools and (2) promulgating tools that meet those standards.
In 2011, an NCI Board of Scientific Advisors working group found that conflicts of interest—intellectual and organizational—afflicted the ambitious project.
NCI officials said the caBIG fiasco taught the institute to limit its goals and use of resources to solid deliverables and timelines.
“Before we even put out the contract announcement for this, we spent nine months of regular weekly meetings working on developing a statement of work for the GDC,” Staudt said. “We really specified exactly what we wanted the GDC to accomplish. The GDC development team knows this is a contract and not a grant. We said, ‘This is what you’re going to deliver,’ and they’re delivering it.
“The big thing is that we didn’t try to do everything you can do in informatics. We had a very well delineated task in front of us, and we figured out a plan to accomplish it. Bite off what you can chew.”
The teams overseeing and reviewing the GDC project are:
GDC Steering Committee Members
Stephen Chanock, of the NCI Division of Cancer Epidemiology and Genetics
Li Ding, of the Washington University at St. Louis
Gaddy Getz, of the Broad Institute at MIT
David Haussler, of the University of California Santa Cruz
Warren Kibbe, of the NCI Center for Biomedical Informatics and Information Technology
Chris Sander, of Memorial Sloan Kettering Cancer Center
Ilya Shmulevich, of the Institute for Systems Biology
Louis Staudt, of the NCI Center for Cancer Genomics
John Weinstein, of MD Anderson Cancer Center
Barbara Wold, of Caltech
Jinghui Zhang, of the St. Jude Children’s Research Hospital
Bioinformatics Advisory Group Members
Barbara Wold, of Caltech
Gad Getz, of the Broad Institute at MIT
David Haussler, of UC Santa Cruz
Chris Sander, of Memorial Sloan Kettering
Ilya Shmulevich, of the Institute of System Biology
Josh Stuart, of UC Santa Cruz
Subject Matter Experts
Sheila Reynolds, of the Institute of System Biology
Jing Zhu, of UC Santa Cruz
Kyle Ellrott, of Oregon Health and Science University
Gordon Saksena, of the Broad Institute at MIT
Ivo Gut, of the Centre Nacional D’analisi Genomica in Barcelona, Spain
Angela Brooks, of UC Santa Cruz
William Lee, of Memorial Sloan Kettering
Katherine Hoadley, of the University of North Carolina