publication date: Mar. 8, 2019
Conversation with The Cancer Letter
Sharpless: With $500 million, NCI can create data federation that would change research in childhood cancer
Norman E. Sharpless
The White House has promised $500 million over 10 years for childhood cancer—whether or not Congress appropriates these funds, it sounds like the pediatric cancer community is in agreement that greater investment is needed in data aggregation and sharing. Are the data needs for childhood cancer being met currently?
This is an area of unmet need. Not solely for childhood cancer, but across all cancers. I think that radical data federation involves multi-level aggregation of data from a variety of sources—genomics, clinical data, radiology, histology.
We don’t really have datasets like that for any population of cancer patients. And I would argue we need it, particularly, if you think about every childhood cancer being a rare cancer, essentially. If cancer is a heterogeneous disease, we really need information on small populations.
I think pediatric cancer’s a great place to start out, because the number of cases is lower—it’s about 16,000 a year. Also, I think there’s tremendous frustration in the pediatric advocacy community that we haven’t been doing a better job of data aggregation and data sharing. And so, there’s a real desire to do more here, and this is a population that is engaged in the issues related to data sharing and data privacy that are important in an effort like this.
And, importantly, it’s hard to do clinical trials and traditional sorts of studies in these populations, where every childhood cancer is a rare cancer. So, you really have to learn from every child with cancer. That’s critical. There’s no luxury of saying, “Well, we can just study part of the population, because it’s so large.” That’s not the case with childhood cancer. I think data aggregation, data federation, is something we need throughout cancer research, but it’s a particularly pressing need in pediatric cancer research.
What are some challenges that are unique to aggregation of data in pediatric cancer?
I think there are a number of challenges to data aggregation in general. There are rules about data sharing and data privacy. There is the issue of data hoarding that groups have with data. I think that problem is probably over-advertised. It’s not as bad a problem as maybe some people believe, but it is still a problem.
Probably the biggest challenge about data aggregation in general—and this is not unique to pediatric cancer data—it’s just a lot harder to do than you might imagine.
There are a bunch of weedy, complex issues that make sharing data hard. Even when everybody wants to share, and we’re allowed to share, and the consent is proper, and all these kinds of complex issues are okay, getting all the data in a way that you want it, that you can link it to the various sources and abstract from electronic health records and put those clinical data in, and making all of those pieces talk to each other in a way that’s safe and secure, and ensures patient privacy, that’s just really hard to do.
But when you do that, the thing you get out of that effort is greater than the sum of its parts. You get these abilities to see what genetic lesions correlate with what histologic features of the tumor, correlate with what sorts of outcomes in the patients. And so, you really can get a lot more out of the data when it’s aggregated and federated in this way.
You know, we have a demonstration project, if you will, of the utility of big data in cancer research, and that’s The Cancer Genome Atlas. TCGA has been wildly successful from NCI’s point of view.
But that’s just genomic data. It’s been used for thousands of publications for research efforts that we never even imagined that it would be used for—going even beyond cancer research.
And so, the next level of that experiment is if genomic data’s good, what happens when you take genomic data to the 10th power?
So, that’s really the intent of this effort. And we think pediatric cancer’s a great place to start, because the system is already set up to care for these kids in a more networked manner than adult patients, and it’s an unmet need.
If the funding comes through, would NCI and the community be aiming for a clinical-grade database, or a research-grade data commons?
I don’t think I would call it solely a pediatric data commons, because when I think of the cancer research data commons that we’ve been working on hard—including with Moonshot funding—that is a set of datasets.
If we called each one of those datasets a “node,” The Genomic Data Commons is one node.
There are several others—there’s the clinical data, the genomic data, the imaging, cohort data, and other sorts of data. Each one of those nodes can be looked into and searched by a common overarching metadata aggregator that can then pull out the radiology and histology and clinical outcomes and genomics of a specific patient, for example—or specific set of pediatric cancer patients.
I don’t think you would want to create a special little pediatric node that would be walled off and separate from that greater ocean of data, because the problem with that is that it won’t be used to the same extent as that greater ocean of data.
So, rather than create its own special walled off node, the idea is to make that infrastructure, that framework I described to work better, and then to actually get a lot of the data. We need to sequence the tumors. We need to extract the clinical charts. We need to upload the medical images. We need to get all those data and put it in a place where researchers can use it.
We envision this to be a very high-grade dataset that will be useful for real cutting-edge translational and basic research.
It would be data de-identified, private and secure, and so, it would be a research-grade dataset to stimulate clinical research in some settings.
To make sure I understand this correctly: we’re talking about a broad vision here, but with childhood cancer as an entry point, right?
I think one could argue that if this effort is highly successful for childhood cancer, then we’ll broaden the efforts to other cancers next.
But in a way, it turns the clinical trials framework on its head.
When you have a lot of patients with the same disease, it’s easier to test therapies. And in that setting, complexity is the enemy; right?
You want to have all the patients be alike and as similar as possible and get the same therapy, plus/minus one modest change to test, if it works. And that’s how we’ve made progress in more common diseases for decades.
But when you’re talking about the other end of the spectrum of rarer cancers and molecularly defined subtypes, and that’s where we’re going in oncology for all kinds of cancer, not just childhood cancer.
As we talk about molecular defined subtypes that are rarer and rarer, it’s harder to use that traditional clinical trials framework.
What you need to do instead is follow every patient, learn as much about every patient as you can, and this sort of real world evidence framework.
And then, figure out why they respond, from analyzing these sort of aggregated datasets. We think this is the frontier of cancer research in general, and as I said, pediatric cancer’s the right place to start.
What are some of the existing initiatives that NCI can currently link? Within NCI, or maybe beyond NCI?
That’s a really good question.
I mean, the support that President [Donald Trump] suggests—$500 million over 10 years—is wonderful and appreciated, but that is not enough money to boil the ocean in terms of big data. As I said, big data’s much more expensive than you might imagine. And the NCI has a lot of experience with these data initiatives, and we know what this costs.
So, really, for this to be successful, we have to leverage existing investments and make sure we use the datasets that are already out there and try and link them and get data and pull data from them to get into this common aggregated and federated dataset that lives in the cloud.
So, there are things like the TARGET dataset, that’s the pediatric version of The Cancer Genome Atlas, that’s 3,400 sequence cases. That’s genomic data.
There’s the Gabriella Miller Kids First data resource, which has got some germline data in it and first-degree relatives. St. Jude [Children’s Research Hospital] has a lot of patients that are sequenced with some clinical annotation, and so, we’ve been having a lot of discussions with them.
The Children’s Oncology Group has Project:EveryChild that NCI supports. That’s got a lot of stored samples and some clinical annotation, and most of those children were treated on clinical trials.
None of these existing things are perfect. They all have some aspects of the elements we want, but by putting them all together and making them searchable—the vision is that you would just go in as a researcher and look for, say, who with neuroblastoma responds to adriamycin.
And you would know if that was a St. Jude’s patient, or a COG patient, or wherever the source came from.
We also, by the way, are thinking about how we would collaborate with international stakeholders. There are a lot of other countries that want to do better with their pediatric cancer data. The World Health Organization has a major initiative.
And because the cases in pediatric cancer are rare, getting more data from other countries is a useful thing. There are some challenges unique to global data sharing, but for pediatric cancer, some uses of international data will be important, too, we think.
And might this also be an opportunity for public-private partnership, if the money comes through?
I think everything’s on the table as to how we build this out.
It is unimaginable to me, given the expertise that exists for data analysis and data aggregation in the private sector, that we wouldn’t be relying heavily on industry partners for some aspects—be that as a contractor to help us extract the data from the charts, or as a cloud resource provider to help support some of the systems, or a machine learning company to help do cutting-edge analysis.
I think we will have specific tasks that will require industry partnerships, as well as many academic partnerships, and partnerships with the cancer advocacy community.
I think all of those things are likely to be an important part of this. Once you have the infrastructure built, say, you want to get some sequencing data, it’s possible for a separate initiative.
Companies and organizations that sequence tumors, they can put their data in our dataset. So, once the common structure is there, it allows everybody to contribute data to the sandbox and all things work better.
Where is NCI currently in terms of its capacity to do sequencing? Would it be useful to have really deep genomic sequencing, whole genome and exome sequencing?
We have sequenced a number of patients, and we have access to sequences done by others for a number of patients.
But I think you’re right, some of the money for this effort would also be used to pay for additional sequencing.
But I want to be clear, not most of the money.
This is not TARGET II, a sequencing effort. This is just to fill out some key datasets where we feel like the sequencing data were missing.
I think the kinds of sequencing we would need would be minimum analysis of DNA, which could be for kids a whole genome is more important, because they have structural variance and other things that are harder to find with whole exome.
I think we’ll need some germline sequencing, and already have a lot of germline sequencing, but we’ll need to do that as well.
But I think, importantly for kids, the tumors tend to have fewer mutations and often, certain subtypes the driver mutation tends to be the same thing over and over again. So, DNA sequencing is not generally enough for this population.
You need some assessment of the epigenetic state of the tumor through either RNA sequencing, and/or dedicated analysis of chromatin.
So, we think some sequencing will be required. Obviously, we have a lot of sequencing data already that we will use and aggregate in these datasets.
And, of course, other groups will sequence and contribute those data. But it’s likely we’ll continue to need more sequencing, particularly to get at the epigenetics data of the cancer. That’ll be really important.
How would you describe the impact of in-depth genetic analysis in the pediatric space? Have we long ago moved past establishing proof of principle as we know it, and is the impact meaningful and substantive?
I think there has certainly been successes from genomic characterization from pediatric cancer.
So, the appreciation that there are rare responders to pediatric immuno-oncology approaches—these kids with microsatellite instability, the MSI-high tumors.
Usually, pediatric cancers don’t respond to those drugs, but there are rare patients that do, they’re identified through sequencing.
I think the appreciation of out translocations and certain neuroblastoma, and other kinase targets that were identified for adult cancer were then validated as pediatric targets through sequencing efforts.
But something that still happens today is, you have drugs that work in kids, where the children respond—in some cases very nice responses—and we don’t know why. It’s not really specified by any DNA mutation.
So, there are patients that respond to a drug like adriamycin or a treatment like radiation therapy, and we can’t predict that solely by analyzing the DNA.
So, there is more molecular information we need about those patients to really predict who’s going to respond—to solve this key question in clinical oncology, this decision problem of, how do you decide what drug to give a patient first?
That is a huge problem not only for kids, but also adults. And we really can’t answer it. Our ability to predict response is still very limited, and as you know, highly impure.
We treat people two months, we get a CAT scan and see that the tumor didn’t shrink. That’s the most frustrating thing in the world as an oncologist, to give someone months of ineffective therapy.
So, I think this is an opportunity to try to figure out: What do you need to know about a child’s tumor, what molecular information do you need to know?
Or maybe it’s not just molecular information. Maybe radiology helps. Maybe clinical features help, etc.
What set of information do you need to know to predict what therapy’s going to work best?
Since this is all going to require significant investment—and we have a proposed $50 million a year—if you could submit a budget request, what would be the ideal amount?
As I said, big data’s very expensive. But $50 million a year for 10 years is a significant investment. I mean, that would help a lot.
Certainly, Congress decides the appropriation, were they to give us more, we’d find a use for it. I mean, NCI could always use more support for great cancer research.
I think one part of the portfolio that would be really important is the ability to give some research grants. So, some of the funding would go to research initiatives both in terms of analyzing data using novel techniques, so fund people to do cool machine learning or artificial intelligence approaches to data analysis.
But also, to use these datasets we create for specific purposes, like try to understand response to therapy, or try and find a new target for drug development using these federated datasets.
And to that grants portfolio, that could be augmented with additional funds. The more we could spend on investigator-initiated research in the community, the better we do, I think, in cancer research.
So, that’s part of it. I think there are lots of other portions of the data initiative that could be built out based on what Congress decides to appropriate.
This might be a question for Congress as well, but as you know, the STAR Act is authorized to spend $30 million per year over 5 years for the creation of a biorepository. Should that money be separate, or used together with this? What are the chances that the $500 million might come through as new cash?
The STAR Act has some broad direction for HHS, and NCI’s part is focused on biospecimen and survivorship research.
We have already begun implementing the STAR Act with specific funding opportunities in FY 2019 and have some great stories that we’ll be able to talk about as those funding announcements get a little further along, and when some of the new projects are really built out.
And that’s really a great thing. We needed a better survivorship portfolio and better biospecimens collection.
But this effort, this new initiative will then build upon that framework, that foundational work. And really be sort of a force multiplier, if you will, for that effort.
Because, if you think about it, you collect all these biospecimens, but then you need additional money to sequence them and to clinically annotate them, and to get the radiology images, and to put all the data somewhere where people can use it.
That’s why data sharing’s so expensive; just having the piece of tumor is a very early part of the whole analysis. And we need to do everything.
And so, I think the STAR Act is, in some ways, a great taking-off point for this initiative. But I think it’s also important to say that this initiative would not only facilitate and improve survivorship research and biospecimen analysis, but I think it really helps with every area of pediatric cancer research.
If you’re interested in response to therapy or pathogenesis, you’re interested in why kids got these cancers to begin with, or you’re interested in disparity populations within pediatric cancer.
These are all things that are hard to study because pediatric cancers are rare. But a big data initiative allows you to work on almost any area of childhood cancer research, including the laudable goal of advancing survivorship research.
What are the next steps towards making this an effective initiative?
We are already working on these ideas—NCI has a robust portfolio of childhood cancer research and we’re already starting to meet internally and with stakeholders to talk about how we can really facilitate the big data initiative in childhood cancer.
Of course, dedicated funding is important and we won’t know that until the FY 2020 appropriations process is complete.
We plan to convene a meeting asking stakeholders as well as data experts to come to the NCI sometime in the next couple of months to talk about where are the opportunities?
And hopefully as we have a better idea of what type of funding might be possible, the size of the opportunity will come better into focus.
But this is an area the NCI’s really focused on.
As you know, one of my key focus areas when I came here—that I’ve been talking about non-stop for 15 months—has been data, using big data better. And we think pediatric cancer’s a really great place to apply some of those principles.
So, we’re going to do this to some extent, but obviously, new funding from Congress would really be appreciated and speed things along.
Did I miss anything?
Let me say one other very important thing, which is that we’ve never really had a dataset like this.
This quality, this size, this scope doesn’t exist in any area of biomedical research. And so, this is an important first step in learning how useful radical data sharing and aggregation can be.
Therefore, we really expect it to inform not just childhood cancer, but every kind of cancer.
I think that studies done with these multimodal datasets will benefit non-malignant disease, will have implications for things that aren’t even cancer, just the way that The Cancer Genome Atlas has been used for lots of purposes that have nothing to do with cancer.
So, I think these big datasets are very valuable and useful, and I think childhood cancer is the pilot phase. But we envision that what we learn from this effort will be useful well beyond childhood cancer.
And also doing it in a federated way, which would also be a new way of doing things.
Yes, the machine learning community’s really coming to us, and they’re saying, “We can’t use our cool artificial intelligence technology if you just have radiology and no clinical information. Or you just have pathology and no response data. Or you just have the genomics.”
Their modern, cutting-edge analytic tools really work better in these robust multi-modal datasets, and I fully agree with that. I think as you start adding these things together, it’s not really just additive, it’s sort of an exponential growth and utility.
I think this effort could be a game changer for childhood cancer patients, and I’m excited about what we can achieve.