false
Catalog
Deploying Cloud based Platforms and Analytic Tools ...
Deploying Cloud based Platforms and Analytic Tools ...
Deploying Cloud based Platforms and Analytic Tools to Support Covid 19 and Beyond (Webinar)
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Great. Hello, everybody, and welcome to our webinar today. My name is Helen Burstyn. I'm the CEO of the Council of Medical Specialty Societies, and I'm delighted to be your host for today. This is the fourth in our series of six webinars that CMSS is putting on in collaboration with the AAMC with support from the Gordon and Betty Moore Foundation on advancing clinical registries to support pandemic treatment and response. The goal of this series really is to try to identify how the field can transform through rapid cycle learning and development using COVID as the sort of immediate case that we're all very invested in investigating at the moment. I encourage you to take a look at our website, view recordings of all the past webinars that are available there, and then register for the upcoming events that I'll share with you at the end of the webinar. I invite you to tweet about the conversation using the hashtag COVIDRegistries. Follow us at CMSS and CMSS.org for more information. Just a few housekeeping pieces of information. We'll do a Q&A after the presentations are done. You can submit your questions via the chat box and the question box on the right-hand side of your screen. It's all being recorded and posted on the website, so you can share with colleagues or anything you'd like afterwards, and you should be able to access a PDF of the presentation slides as well. At the conclusion of this webinar, like all webinars, you'll get a short evaluation. And on this webinar, we're going to hear from this remarkable group really helping us look at how we can deploy cloud-based platforms, analytic tools to support COVID-19 and beyond. This really builds on the work of our prior webinars. On July 8th, we had a webinar led by Atul Butte on the potential to accelerate availability and access to electronic data registries, electronic data sources within registries, and then thinking about how data can facilitate real-world evidence. Just July 17th, we had a second webinar that looked at three different clinical registries addressing COVID with an assessment from Michael Howell from Google Health of how technology could be leveraged to really move the field forward at a time when it was so important to have standardized data and work across integrated systems and registries. And today really builds on those first two presentations, and we're delighted to have Bill Marks, who unfortunately has some camera issues today, so he's shy. As our moderator today, Bill is the head of clinical sciences and the head of neurology for Verily Life Sciences. He's also an adjunct professor of neurology at Stanford University. He will play moderator to introduce each of the panelists, and we'll all come back together at the end. So with that, I'll turn it over to Bill. Thanks, Dr. Burstyn, and again, sorry for the lack of a video feed, but the important thing is you'll be able to see the presenters and their slides. Well, I'm very delighted to moderate this session on deploying cloud-based platforms and analytic tools to support COVID-19 efforts and beyond. You know, it's really interesting to consider how the capabilities that have already been developed in just these past few years in providing large-scale data platforms and powerful data science techniques can be harnessed right now to advance our knowledge of COVID-19. Perhaps equally compelling is the driving use case presented by the pandemic that really demands accelerated capability development, harmonized approaches, and structures that more than ever facilitate collaboration. Today, we'll hear from three speakers who will provide insights into all these areas designed to provoke discussion today and, I hope, action moving forward. Our speakers are David Glazer from Verily Life Sciences. He'll kick off the program by introducing the principles, practical considerations, and power of large-scale, multimodal, cloud-based data platforms, presenting the experience in the All of Us program as an exemplar. Next, Chris Tremel from the American College of Radiology will discuss his experience in platforms for imaging data, challenges and opportunities around federated registries for collaborative research, and highlight a program focused on imaging in COVID-19. And then finally, Dr. Andrea Ramirez from Vanderbilt will take us back to the All of Us program described by David Glazer, highlighting the process of the real-world scientific research made possible by such platforms, some of the early findings from the All of Us initiative, and how these insights might be applied in the clinic. Well, let's begin. I'm very pleased to introduce my colleague David Glazer, who is engineering director at Verily Life Sciences, where he helps life science organizations use cloud computing to accelerate and scale their work with big data. He's a principal investigator for the Data and Research Center and a member of the steering committee of the NIH All of Us Research Program. David serves on the NIH Advisory Committee to the Director, is co-chair of the Cloud Workstream, and is a member of the steering committee of the Global Alliance for Genomics and Health. He is a founding member of and on the organizing committee for the International Common Disease Alliance. He previously worked at Google, where he founded the Google Genomics team and led a variety of platform product and infrastructure teams. David earned a Bachelor of Science in Physics from MIT. Take it away, David. Thank you, Bill. Thank you, organizers, for the opportunity to be here. I'm looking forward to talking through this and hearing the other presentations. So, I'm going to, as Bill said, talk about, well, I'm going to talk about it if the slides advance. Let's try that again. Talk about the platform work that we have done to support doing biomedical analysis in the cloud. Julia, I may need your help. This is not advancing. There we go. So, the, you know, the overall problem statement here is how can we take advantage of modern cloud technology to help advance science and medicine? Next slide. And by doing that, we, when we think about what we needed to, what problems we need to solve and how we want to design it, one of the first things we realized is biomedical data comes in a lot of different shapes and sizes. And we really think about all of these different modalities of data, because the most interesting projects that are happening and the most interesting opportunities that we know of involve multiple and often, and often all of these different modalities working together. You'll hear about in some of the examples, like all of us, how it pulls together some molecular data and some sensor data from things like Fitbits and survey data and pulls it all together. Next slide. So, when we think about the opportunity using of the cloud and what problems it can help solve in biomedicine, we contrast it to the traditional approach. Before cloud was an option, the state of the art was people would publish a data set and then any researcher who wanted to work on it would download a copy of the data to local storage. They'd have to manage their own data, they'd have to manage their own security, they'd have to have their own tools, and it was hard to collaborate. Conversely, what the cloud offers is the ability to have centrally curated large data sets together with centrally curated shareable tools that allow researchers to work in one place, one copy of the data, one place where you have to manage security. And probably most important is by having a cloud-centric approach, it facilitates collaboration. It allows researchers to work together and stand on each other's shoulders as they move forward. Next slide, please. One of the other things we realized is that in order to take advantage of the cloud, we didn't want to build this in a closed monolithic way. We thought the best way to take full advantage of the promise of the cloud for biomedicine is to build it in a modular and interoperable way. And so we have been working with a large community of organizations that have like-minded thinking about this and are building the platform work that we do in ways that are open source wherever possible, which is usually everywhere, that are community-based, and that use standards from organizations like the Global Alliance for Genomics and Health to connect the different components into a coherent system. Next slide. So with those principles and with those goals in mind of supporting biomedicine in the cloud, what we have put together, this is when I say we here, I'm talking about the largely a collaboration between the Broad Institute and Verily Life Sciences. We have put together and are continuing to build Terra, which is a scalable and secure platform for accessing data, running tools, and collaborating in the cloud to advance biomedical research. Next slide. The core principles of Terra, the core value proposition is work on integrating data and integrating tools from all the different sources, and then most importantly, pulling it all together into collaborative workspaces that allow researchers to do science using the tools they want to use on the data they need to analyze. Next slide. In one slide, what does this look like architecturally? Well, you can see reflected in the outside of this diagram, the three stakeholders whose lives we are trying to make easier by providing this cloud-based platform. We have data generators who are the ones who are making data available to selected audiences, tool developers who create tools that would like to get widely used, and then researchers who pull that all together. This is built as a platform in green on those three pillars of data, tools, and workspaces, and then there's a variety of different front ends that we and others have built on top of the platform. I'll be doing some examples from all of us and a little bit from the community workbench to show some of the different ways people are putting these principles to work in advancing science. Next slide, please. Some of the different data that we have available in Terra that people have made available via Terra and are analyzing via Terra is shown here. You can see there's a wide variety, not only of sources, but of types of data, everything from single-cell data to cancer data to epidemiological data. Some of this data is very widely accessible, the publicly funded data, some of it is proprietary private data, all of that is available, and most importantly, all of it is, if you have permission, is cross-analyzable. Next slide. Similarly, on the tool side, we wanted to make available the widest possible array of tools. These are some of the different tools that people are using Terra. And next slide, we're happy to see the early adoption of Terra where there are, at this point, there's a couple of thousand researchers who use Terra every month, maybe about a 20,000 researcher total community. So next slide. That's an overview. The key takeaways there are the principles of what we're trying to solve with biomedicine in the cloud and the pillars of data, tools, and researchers. Let me walk briefly through the All of Us Research Program. Next slide. I'm going to go very quickly on the background of the program and the science because Andrea will be talking in more detail about that later. I'll do just enough to provide context for the platform capabilities that we are making available to researchers. So All of Us, as many of you know, is in the process of enrolling a million or more Americans and collecting many kinds of data that they are donating to the program and then making this data available as widely as possible using cloud-based tools. Next slide. The way that this is done is there are many different components of the program that are all working together to generate the right data, all of which is sent to the Data and Research Center where we collaborate very closely with Vanderbilt and the Broad and Verily to build these data and research tools. Next slide. As all the data arrives, the first step is what the teams largely at Vanderbilt and Columbia do is curate the data from all the different sources and pull it together into a curated repository, a single cloud-based curated repository that is then made widely available to researchers. So every researcher doesn't have to do their own data cleaning and harmonizing. The most recent version that's been published is always available. Next slide. We just, what is it now? This is August, June, July. So two to three months ago, we just released the beta version of the researcher tools. You can see those numbers at the bottom are a little bit stale. You can go to researchallofus.org and see the latest. But there's a growing amount of data being collected, a subset of that's available publicly available to researchers at any point. And starting in May, the tool set that we've been talking about has been available. Next slide. There's a set of tools online at researchallofus.org as part of the platform. On the left of this diagram are some of the public tools. Anyone here on this call can go to researchallofus.org and do things like say, well, how many of the participants who have enrolled so far have been diagnosed with epilepsy? Or how many of the participants who have enrolled so far answered, smoke more than this many packs per day? You can do some of that at an aggregate level as a member of the public. Then if you register as a qualified researcher, you get access to the researcher workbench tools shown here on the right. Next slide. What those researcher workbench tools do when you log in is give you access using the Terra platform to the different modalities of All of Us data. Next slide. First thing that people tend to do in the research workbench is define a sub-cohort of the couple hundred thousand now growing to a million participants that have the characteristics they care about. We have tools to help you build cohorts of the characteristics, whether those be demographic or self-reported or clinical or genomic. Pick the sub-cohort you want. Next slide. Once you have defined a cohort, you can then choose the columns, the fields that you're interested in analyzing of all the different data that's been collected. Next slide. You put together the rows of participants and the columns of data types. You now have a data set, which you can then save and analyze with that button at the bottom right. Next slide. You then take that data set of just the data you want to work on and you save it into a workspace. We've talked earlier about collaborative workspaces. Next slide. You can see that once you have created your data set and you save it in a workspace, you now have a variety of tools. What I'm going to show is just a couple of screens of using Jupyter Notebooks to do analysis on the data subset that you picked. You can see an example here of someone who created a couple of cohorts and notebooks. They created cohorts around dementia and a couple of notebooks with analysis tools of it as an example. Next slide. Once you put a notebook into your workspace, you then have access to all the capabilities of Jupyter Notebooks, including R and Python and whatever libraries you want, including visualization. Andrea will talk more about some of the really fascinating work that people have already done with the initial data to as a shakedown cruise kind of for the platform. This is showing a, I believe this one is showing some patterns of medication over time. So you can see what was prescribed as first line, second line, and so on, and you can visualize that. Next slide. What this then does is gives you this end to end flow through the researcher workbench. And Andrea is going to, I think, has a version of the same slide in hers. We both really enjoy the team that we get to work with. These are some of the many people whose work has led to both the data and tools built on top of the Terra platform, where all of us is using the power of the cloud to do, to really advance the possibilities of biomedical analysis. All right, next slide. That was one example. I'm going to do a couple of others very quickly, much quicker than that one. So next slide. UK Biobank is another very large 500,000 person cohort. There's a different front end built on the Terra platform that includes a cohort building capability where you can pick subsets of the 500,000 that you want to work on by filtering by any of these different or any of, but including these dimensions. Next slide. Once you've picked a cohort, similarly, you can load that into a workspace. These are some of the workspaces that are available. Next slide. If we go into a workspace, we have the same capabilities because it's built on the same platform. This is an example of a workspace that we reproduce some published results of an analysis of UK Biobank as a way of confirming that the platform had the capabilities that were needed. So this was a platform, a paper that came out a couple of years ago. You see the reference at the top doing a correlation between the genomic characteristics of people in the cohort with their measured levels of certain proteins in urine. Next slide. In this workspace, what we did to reproduce the work of the paper is pull together the analysis code, including the GWAS code, which you see here. And the next slide then did the visualization. And as expected, because the paper had already found significance, we were in Terra able to replicate that and show, yes, indeed, there are some significant hits and significant correlations. The important point about this, back to the idea of collaboration and standing on the shoulders of others, is because this was all done in the cloud with shared tools with collaborative workspaces that can be, and in this case were, shared with others, if another researcher wanted to do A, to reproduce the results, they can just reproduce them because they have access to the tools and data. B, if they wanted to apply the same analysis to their own data, they can do that because they have the tools. And C, if they wanted to tweak this analysis a little bit and say, I like most of what you did, but I would have filtered it differently, or I would have weighted it differently. They have full ability to do that, again, because they have full access and are picking up right where other researchers have left off. Next slide. The last examples I wanted to give are two quick examples of some of the very recent, obviously, work that's being done using Terra to advance understanding and research and insights into COVID. So, on the next slide, we can see one workspace that has been, this is an overview, we can see a set of the workspaces. Next slide. Let's go to see, to look at one of the workspaces that's been published for COVID analysis. This is a workspace published by a lab at the Broad that's doing viral genomics, and they took their whole pipeline, all the steps, both secondary and tertiary analysis, that goes from viral sequences, hot off the sequencer, all the way to phylogenetic trees to watch the evolution and spread of the virus over time. And they took all of the tools that they used to do that and made them available publicly via Terra, so others could do the same and contribute to the growing understanding of the evolution of the virus. Next slide is another analysis done. This one was an analysis of how the virus was a different expression, a different single-cell RNA in different, along a variety of different dimensions to try to understand susceptibility, severity, and at the means of attack of the virus into different cells, different people to try to find, hey, where are the mechanisms and pathways that could potentially be, could lead to interventions? One of the interesting things about this is you see on the left is a snippet of the preprint from bioRxiv, and you see in the methods section of the paper, there's a link to the Terra workspace. It's not just describing the methods, it's here are the methods. And on the right, you see in the Terra workspace, a link back to the paper. So this complementarity of being able to have the methods of your publication be live and available and have the live and available workspace point back to the published results is we think a lot of the power of cloud platforms for biomedicine. Next slide. So wrapping up and thinking about, you know, when we were preparing for this, Bill asked us all to think about what do we think the theme of these presentations means for the future? And the key theme that I'm seeing and that I'm excited about is that taking advantage of modern technology, biomedicine in the cloud, really sets us up for rapid response. So on the next slide, we've talked about a few of these examples. At the beginning, I talked about the benefits of the cloud-centric approach. What that adds up to is a benefit that I didn't really anticipate when we started down this path is this approach also sets us up to be as prepared as possible for the unexpected. In this case, the unexpected being, of course, COVID-19. Because we have collaborative platforms that already have momentum and capabilities in place, it allows fast sharing of new tools, new knowledge, and new data. As in the example we just showed you with the COVID-19 workspaces, there's another example Andrea may be talking about some surveys that all of us put out there that we can quickly make that data available to the research community. And second, in addition to fast sharing, it allows fast special purpose analysis. And the two examples of many, but two that I'm aware of, is when UK Biobank did their internal thought thinking about how can they help with the global response. They said, well, we already have a whole body of data and tools around our participants, so we can quickly add on a little bit of extra information by surveying our participants, and then make that widely available already using the platforms that they have in place to start to generate some of the first strong evidence that I've seen around the genetic factors affecting COVID susceptibility and severity. Another example that in the all of us world is similar, given the platform that was already in place, all of us chose to run serology assays on the biosamples that had already been collected, which could work with the other participant data that had already been collected, the tools that were already available. So that as things become available, we're again standing on the shoulders of previous giants and standing on the shoulders of colleagues in the field. So thank you for the chance to go through this. I'm going to wrap up, hand it back to Bill, and look forward to taking questions after the three speakers. Thanks so much, David. It's a really great overview of the sophistication of data platforms now available, and as you said, their ability to really enable a rapid response to research. I did want to encourage everyone to start thinking about their questions and submitting them for David and for the other speakers. Well, our next speaker is Chris Tremel. Chris is the director of the Data Science Institute at the American College of Radiology. He works across the college and other organizations in the advancement of artificial intelligence, standards, best practices, and the ACR's programs and solutions. Chris has a bachelor's and master's degrees in software engineering from the University of Wisconsin. Nice to have you here, Chris. Excellent. Thank you. So, yes, I'm going to talk today about this novel partnerships and research tools that we've experienced or helped develop and have been a part of. I'll just start off by, for folks maybe who don't know who we are, so the American College of Radiology, we're a member-based organization in radiology, about 40,000 members or so. And I think to just throw some background on this, we have a division called the Center for Research and Innovation that focuses on doing research that has about 140 staff members full-time. And just because we are imaging-focused in radiology, about 2 million plus images we process every year through different research projects. And a lot of what we've been doing the last couple of years, we go more and more, is kind of what David hitting at, too, is more and more what's happening in research, and as we're trying to go more, it's not just a single field anymore. It's not just we're bringing in images and do analysis on that. It's images plus labs plus demographics. And yes, we're very radiology-focused, but more and more of this coming together. And so we're doing lots and lots of projects in that area. And we've had, I think the last that I saw is we've conducted about, I think, 500 major research projects in the last several years that we've been doing this. So I'll move on to kind of the first topic here is just different operating partnership types that we've seen for different projects like this, and COVID-19 projects that we've been seeing crop up. And we really classified it down into about three major types, and I'll skip over one pretty quickly, but single-party, primary, secondary, and federated. And I lost my mouse there for a moment. There we go. So single-party is a relatively easy one. It's what everyone does by default, because it's simple. It's basically you are operating the entire research project yourself, just the most widely used one, because it's that much straightforward. And part of it just gives you lots of control. You control it start to finish. I mean, the downside of it that most folks have experienced is it's very difficult to scale in this or to go into areas outside your expertise. So you typically are very limited by the resources you can contribute to that project and how you can grow out. And what we are seeing more and more over time is folks kind of going to this next level, which is primary, secondary, I've heard people also call kind of the subcontractor model. And this is where, for instance, someone doing a research project, and they really want to work and get data or aspects from something that's in a complementary expertise. So an example of this is recently we worked with the PETAL group, which is the Prevention and Early Treatment of Acute Lung Injury Network, where they had a registry, a research registry up and going for COVID-19 cases amongst their current network, which had quite a few folks, but they were very much clinical form-based kind of collection data. And what they wanted to do is start bringing in images with it as well. Images have their own set of complexities and anonymization parts behind it too. So we worked with them as saying, essentially, we will work with your sites to help you collect the images along with these patients, do the QC behind them, make sure that they're valid as what you're looking for, and then help bring them to you so you can form them back into a larger registry. And as I kind of went over there too, this starts up the complexity a bit too, because you have to start thinking about different governance aspects that you just wouldn't have had to consider before. Even simple things like, so when you anonymize the data, is every single data point going to be unique? In that case, the subject ID mapping, not such a big deal. But if you want to make sure you can bring all the data together back down to one patient, you need to start having a reproducible patient ID to subject ID mapping and make sure that all the different parties you're working with are following that same suit. Otherwise, the different parties you're working with are going to start collecting data and it's not going to come together the way you want in the end. And one of the models that we're seeing just kind of start coming into play here that we've talked about with folks for a while is this federated model. And I think what we're seeing come to place is that really this is probably one of the higher complex types of operating principles you can be under. And so this is really a multi-entity that the entities run relatively independent from each other. The benefit is you can go to really, really high scale because you can bring more entity on. You can have entities start and stop at different points. But it's extremely high complexity because what you have to do is you have to really set the rules of the road for how the different organizations are going to be part of your federated model are going to work with each other. So what I want to do is dive in depth on an example of one that we're doing right now from this federated model called the MIDRC. And I was lucky enough, this was fully finalized and announced yesterday so I didn't have to go and scrub all my slides and say this was a work in development since we now are fully going forward with it. But what the MIDRC project is, the idea is it's a virtual registry on multiple different groups. And the reason why we went down this path is we saw ourselves, the ACR, RSNA, AAPM, TCA and others all looking to set up COVID-19 research registries around COVID, around imaging. And they were a lot alike in terms of what they were asking for sites, what they're trying to accomplish, but they all had different goals of it. So for instance, on the ACR side, we were looking at a lot of traditional research aspect to be able to go in and be able to parse the data. So we're looking for very data rich aspects. RSNA, for instance, was very much looking at how to make public data sets for AI training around these. So they're more looking at more mass data and more simplistic models around there. And so what we ended up doing is work with all these different groups to say, let's come up with a virtual registry where everyone can do their own aspect on this. But then what we can do is, what we can do then is work together to say, here's a set of handshakes and how we're going to make sure our information can be shared outside of just our single registry. And so while we're not really competing with each other, but we're complementing each other across the board. So what that ends up being is that from a governance standpoint, we had to work out a lot of different aspects, like what are unifying data elements now inside of any given independent entity? You might have more data elements or have other aspects to them. But I mean, one aspect, we say, here's how we're going to do subject ID mapping. Here's how we're going to do date based mapping. Here's how we're going to enroll sites that way. We make sure that we're not enrolling the same site in different parties and it's not clear how it's playing on. And also things like timelines, too. So even though they can all run different timelines to make sure across the board, we have a central expectation. And just to kind of show, I think, overall what the complexity of something like this is, as I hit on it, is this, I'll show you just a quick example. So as we were putting this together, this is just one diagram we made of just how some of these different parts and pieces move together off here. And I won't dive too in depth on it. But this was one part, I think about a 30 or 40 page document, just kind of saying how we were going to run this thing across the board. So each one of these different vertical lines is a different organization that's participating in the virtual registry that then goes to this overall public portal gateway that queries and works with the individual ones. But the benefit being is that each one of these different slots, they can have their own secondary priorities. They can have their own goals they're going after on top of this. But then more and more can be stood up or if over time different ones decide they don't want to contribute anymore, they don't have to anymore. But the overall project itself continue to live on and grow. Oh, there we go. And so one of the, I think what also makes this one really complex and kind of leads me to the next topic I want to hit on is, I think what's interesting about COVID in this one, and I think a lot of these different registries and research projects we're going after is there's more and more unknowns. So we're trying to say we want to collect more and more data, more and more different types of data. And what that ends up meaning is that just the complexity of actually bringing this all together is going to be very difficult. So some of the things we're collecting inside of MIDRC is different demographic data about the patients, vitals, multiple vitals across the patient. So handling repeatable values and how we want to make sure that happens, labs, images, of course, diagnosis codes, and then having various ways where we say we can have multiple labs, but we want to have a single age behind the patient or multiple this, but have a single one of these. And so what we've found in the past for projects like this, and even more simplistic ones, is once you can get the data up from the site to a centralized location or a cloud-based location like David was talking about, that's when you can really start doing some really, really interesting work. But as we've talked to a lot of data scientists, and I've been doing this for a while, a lot of times where you get a lot of struggles bringing things in is how do you get that last mile of saying a site wants to participate, they want to contribute data, but pulling it out of your local systems, actually mapping over to what the research project needs and requires so you can unify it all together, becomes a stumbling block for a lot of folks. And so that's where we've been working on a platform we have called ACR Connect. And one of the major goals of ACR Connect is that it's a platform, but it lives inside of your healthcare organization's network. So because it lives inside of its goals, it's a last mile collection point to help gather up data and bring up data to these larger aspects overall. So for instance, inside of this example we have MIDRC, when we're looking inside, everyone can be doing a different aspect, but then what we're looking at here, and they would all have different sites, is we would have our sites, and we would make sure that we enroll, there we go, we'd have the Connect node at every single one of our sites or as many as possible. There's different ways of doing it too. Because then once that would help us connect to local systems and send it up to a lot of our cloud-based solutions that we do, so we can do centralized QC to help us sites and storage and then make it available with different portals and along to the greater MIDCR public portal as well. And so diving in a little bit more on exactly how this ends up working for us, if you look at a particular instance, what we can do locally then is talk to your PACs, your EHR, data warehouse, we're working on different ones with laboratory and reporting systems as well, and be able to talk through various different standards here, so DICOM, FHIR, HL7, and gathering up as much of the information as possible for registry. So in this example, we have the 40 different images or 40 different data elements, helping search of the systems, gather up the images, gathering up the vitals from the EHR, gathering up the background and information on the patients, and conglomerate that together for the user so they're not hunting and packing and finding all that. And then from inside of Connect, then we can help to localize mapping and localize mapping and quality control to make sure everything's looking good and automatic abstraction. And also a really big one too we can do for folks as well that we found just really lowers the overhead for folks to participate is local anonymization. So that way before we're sending things out, we can anonymize the images, anonymize the different, the patient IDs, MRN, the session numbers, but also keep that mapping local as well. So that way it's not a one-way because we have that local mapping stored inside of your firewalls for you to access. You can always go back and redo it. You can go find patients as they map up behind that. Which really then the goal being is that you can do a lot of this work locally to make sure that the payload you're going to send up to these more centralized or cloud-based systems is what they're expecting. And that way you can make sure that the richness of the data, integrity of the data is as solid as possible while also still trusting the security integrity of what you're sending out the door you know is not going to be a PHI slip up or there's some aspect that's wrong with it as well. And so that's up. And so then from our angle, then what we do is connect would be distributed to every one of our sites in there. Right now we're currently in a stage two rollout of Connect. We started working on it about a year and a half ago. And so we're rolling out to an additional 20 to 30 sites right now. And the idea is then as that starts happening, more and more of these sites can participate with much lower overhead to contribute as much data as they feel comfortable sending up to these different aspects. So then once that's gone through, you can use these various different, very interesting and very intuitive tools and powerful tools to be able to collect that, to be able to work on that unified data aspect as it's come forward. So that was a quick overview of a couple of different things that we're working on and seeing a part of. Oh, and then it goes back to the major one. There we go. So wrapping up on it, I just want to hit some of the major points. So really we've been seeing can classify most of these different operating ones in a few different modes, single, primary, secondary or federated kind of growing up in complexity over time. The MIDCR example, I think is a really good one about how COVID-19 or other research areas like it are really presenting areas where high scalability is a huge benefit for us as opposed to looking at very simplistic models that keep you at a lower scale. And that's also then I think leading it towards ACR Connect and what we've done there to help really help solve that last mile problem to get sites to be able to participate and be able to contribute data in a way that doesn't have a large overhead and cost on them. And that's, thank you everyone for attending. That's great, Chris. Thanks so much for these very concrete examples and again, encourage those of you to be submitting your questions right now to our speakers. And our last speaker is Dr. Andrea Ramirez. She's an assistant professor of medicine at Vanderbilt University Medical Center in the Department of Endocrinology, Diabetes and Metabolism. Dr. Ramirez is the founding director of the Adult Vanderbilt Genomics and Therapeutics Clinic and the Precision Diabetes Program, which delivers genetic testing directly to patients. Her NIH-funded lab studies monogenic forms of diabetes using multiple approaches, including broadening phenotypic data collection and association with genetic variation. Now for the All of Us Research Program, Dr. Ramirez is the director of a data and research center data science team and has led the demonstration projects effort across the consortium, including approval and execution of over 35 projects by 10 independent teams thus far. She received her bachelor of science from North Carolina State University, earned her doctorate of medicine from Duke, and completed a residency in medicine fellowship and master science in clinical investigation at Vanderbilt. Welcome Dr. Ramirez. Unmute. Sorry about that. Hi, everyone. Great to be with you today. I really appreciated hearing from David and Chris earlier, so I'm happy to be here again on behalf of the data and research center. We're excited to share this work. I may, ah, there we go. Okay. So you've heard from David earlier, oh, all my clicks caught up to me, sorry. So the NIH All of Us Research Program, as you've heard, is a historic longitudinal effort to gather data from 1 million or more people. I liked Dr. Collins' quote here, excuse me, sharing, sharing that all of us is among the most ambitious research efforts our nation has undertaken. I think this quote was made before COVID-19. I think we might be thinking a little bit differently about it now. So I hope to share with you a little bit about what we've done in all of us so far and what that's going to look like going forward in a post-COVID world. So, again, this is on behalf of this wonderful group of people David and I have both been fortunate to work with. So all of this research program's goal is to accelerate the central mission to lead health research to medical breakthroughs. In doing so, we hope to catalyze a robust ecosystem with parties like this group listening that will be where we'll be delivering the largest, richest biomedical research dataset delivered to date. So in the research hub, you can access and analyze all of us data. As you heard from David, there's a public side with aggregate data shared in the data browser. And then after a data access approval process, you can move into the researcher workbench, which was shown briefly and we'll talk a little bit more about now, into what we've called the registered tier that does give you participant level data, not just the aggregate data on the publicly available side. Again, we hope you all can visit the Research Hub. You saw briefly the data browser and some of the other tools available publicly. There's also been a research projects directory made available. So on the public side of the Research Hub, you can see all of the researchers who are working in the Researcher Workbench as well and what they're working on. So let's talk about that Researcher Workbench a little bit more. We are in a beta launch of this product, and it's important to message to researchers that there is an institutional agreement required right now. We hope to remove some of those access barriers in the future, but right now, we're using an eRA Commons account as an identifier. We're welcoming feedback in this beta phase and excited to hear from users. The tools are evolving all the time, and the program cohort is actively growing, as is the data. So the goal for this, again, is to be a true game changer for understanding health using the powerful tools built by Verily and Broad, the Terra platform, and the Workbench. So what data are available now? As you heard in the protocol overview, there is a multi-step process where participants enroll, consent, and authorize sharing of their EHR, they answer surveys, and physical measurements are taken. Those are the three primary data types available now for researchers. They also provide biospecimens, which are being used currently for genotyping and DNA exploration. That data is not available yet, but will be coming in the future. And then wearables and digital apps have been deployed to participants, but that data, again, is not yet available in the Researcher Workbench. So how much is there? For those who've completed the survey, completing the survey was the requirement to enter the data set that's released now in the Workbench. There's over 224,000 participants who have any survey data. Over 188,000 of those have had physical measurements taken, and around 50% or 127,000 or so have electronic health record data available as well, and you can see the overlap there. We're proud that this is a diverse cohort. All of us has always intended to engage a very diverse cohort. They've defined cohort diversity in many different categories. So overall is what's being shown here. 77% of participants in the current data set identify it with at least one underrepresented group defined in the eight categories below, either race, ethnicity, income, age, sexual and gender minority categories, sex at birth, which includes sex at birth, gender identity and sexual orientation, or education level. And all of this is described in a paper released on MedArchives earlier this year. We did some deep dives into the data quality of what's available there, aiming to replicate work that had already been done, not make first discoveries that were happening in the community. So what we were looking at was, can we replicate the effects of smoking? So can we show that people who we know smoke either from their electronic health record or because they said they smoked in a survey have the known outcomes and the known impacts of smoking on their health? And what you can see there are the odds ratios of the top three increased risk effects and the top three decreased risk effects, all of which are replicating signals known in the literature and have a consistent effect between the electronic health record and surveys. So in a huge new program like this, taking data from so many new sources, replication of these known signals was really important to give us confidence to release the data. So again, this paper is out on MedArchive right now, hopefully coming out in the peer reviewed literature soon. As David referred to, all of the work and all of the analyses that were done by our internal team can be made available on the workbench through the Terra platform where all of the work done in Jupyter Notebooks is available as a featured workspace and any registered researcher can go in and replicate that work and have access to those workspaces. So we were just one team, excuse me. So how can other researchers get access? Again, researchallofus.org is the place to go to apply. I mentioned that an institutional agreement is required. The address that is available there and the slides that will be made available if your institution has not applied yet, but over 165 institutions are in the access pipeline right now and 110 have completed agreements since the beta launch in May. So we're really proud that the institutions coming in to work on this data set are diverse as well, from nonprofit organizations and public universities through to historically Black colleges and universities and disease advocacy groups. So we're excited to see that group growing. So now what? Well, if you're in the researcher workbench, you saw some pictures earlier from David's side on Terra. We hope and have designed the researcher workbench to contain the tools researchers need to learn about access and analyze this participant level data. What you're seeing here is what your dashboard might look like, workspaces that are available, some tutorials at the bottom, and then a refresher on recently accessed items. Again, we've built in direct to researcher user support with an integrated help desk as well. And as part of that user support, we've built out a phenotype library, so phenotypes being a collection of observations that describe a condition in a participant. We have taken some already developed phenotypes and implemented them again in a reusable notebook available to any researcher here to tweak if they need to, or use off the shelf from the library. In addition to that, and here's an example of that phenotype library, in breast cancer, again, reassuring to see the majority of the blue circle is female cases with a small amount of male breast cancer picked up as well from implementing an already coded algorithm. And we mark down these notebooks very heavily so researchers can see our thoughts and what the process in building these cohorts was, if people do want to change something slightly in the algorithm for their own purposes. So another example and other things that are provided to the researcher when they enter the workbench are more tutorial workspaces, this one being an example of working with survey data. Survey data in particular, for any of the data scientists out there, can be quite a complex data type to put into a common data model like ours. And so again, we give tips in going through this notebook so other people can apply it to any of the survey questions and already have output what you see in the bottom right, both plotted counts of responses and a frequency table of answers that they can modify as they see fit. So in addition to the work we did internally at the Data and Research Center, we thought it was important to also engage the entire consortium. So through the science committee, we formed a demonstration project subcommittee with the goal to demonstrate scientific utility as the alpha phase of the development of the workbench proceeded to really inform what that beta workbench launch looked like. Again, the goal of this group and these projects were to support the community use of data, not make first discoveries. And we did that through partnering with the consortium. So within the All of Us Larger Research Consortium, over 10 groups volunteered to participate and we had a large kickoff meeting in November 2019. Crazy to think of all those people so close in today's world, but it was a really great meeting as we were moving out of the DRC only phase one replication work into expanding to the consortium getting ready for beta launch. As you can see, over 10 groups on left participated and we covered a diverse health areas and really built some gap addressing projects in for areas that weren't proposed by the consortium to fill in what we thought might have been gaps in covering some of those health areas as well. I have a few examples here of those projects that went forward as well. Right now, All of Us is only enrolling participants over the age of 18. This obviously is not ideal for those engaging in pediatric research. And so we asked the question, is there pediatric data in All of Us? You can think of enrolling participants over the age of 18 while you're not gathering prospective data on their childhood, you do have that retrospective look from the electronic health record. And we wanted to quantify that what we have captured so far in the workbench. Shown here are the overlapping number of participants that have data available in different domains of the common data model shown there with over 5,000 centrally having an overlap of all the different domain elements within an electronic health record that we identified. So we thought that was a really good pull of pediatric data for a program that wasn't designed really to provide anything for pediatric research at the time. And when we broke it down and said, well, what's happening to those participants? Does this look like reasonable data and broke it down into kind of the phases of childhood, what you're seeing, and I apologize for the small print, but it's going from infancy through to late adolescence is what you expect to see. So when the MMR vaccines are given, infancy, early childhood, and that the most frequent event happening across all stages of childhood is an outpatient visit, otitis media being more frequent as a young child, and then inpatient visits being more frequent later. Some of these signals are known and was again, reassuring that this was good data to release to researchers. Another outstanding question was, would data collected by the All of Us program be good for less common diseases? So we used a method developed at Vanderbilt called the phenotype risk score, because we don't have genotype data available yet. Can we look at the phenotype and develop a risk score for some of the less common, not quite rare, but less common diseases that have genetic signals in preparation for that genetic data? And in fact, we did replicate this VRS approach for three Mendelian diseases, including cystic fibrosis, hereditary hemochromatosis, and sickle cell anemia, saying that in those who have cases of those diseases, they have the phenotypic spectrum captured in their electronic health record consistent with their diagnosis. So this again, reassuring that the data we're getting was of high fidelity. And then finally, family health history, something that's becoming increasingly important and recognized. We were gathering it in two ways, both the electronic health record and surveys. What did the overlap look like, and how could we provide that information to researchers so they could use it in their own discovery efforts? What we found was that in the electronic health record, about 20,000 individuals had family health history available, and from surveys, about 38,000 had family health history available. But was anything new? Were these the same participants? How could researchers use this data? When we looked at the overlap, only about 10% had both electronic health record and survey collected data. So this is a really rich area for researchers to come in and understand what's going on in their disease area of interest, or other methods development work. And we were excited that after this demonstration, projects were kicked off. Everyone came back for a culmination meeting in January of 2020, again, very encouraged by the engagement of the group and their success using the tools which supported that beta launch available for researchers now. Julia, I'm getting stuck on advancing the slides. Do you mind advancing for me? I'm getting hung up. It's just flashing on the bottom. If you can advance for the next one. Yes, I'm getting the same error. Oh, you're getting the same stuck? Uh-oh. Sorry about that. So here's our data release timeline. Again, Data Browser launched in 2019. Researcher Workbench launched about a year later in May of 2020. And we're currently in this phase of data set expansion, where we're enrolling more participants, thinking about how to release new data types and really gathering researcher experience from our beta release to inform what the additional data types and tools will look like going forward into 2020. So next slide. But something unexpected happened here. I thought this was maybe the best expression I've seen on a coronavirus so far. And that really threw a wrench or an opportunity into kind of the All of Us program timeline. So I want to talk briefly about what AOU's response has been. Next slide. Discussed just last June, Dr. Collins put out how the All of Us would kind of join the fight against COVID-19. Having built this database and research program, how could we pivot and look at where All of Us could intersect and synergize with other COVID-19 research efforts? Next slide. So importantly, understandably, because of COVID-19, enrollment did pause in March. It was a monumental effort to take what had been so many different sites enrolling across the country and really work. Also, we realized that from before the program shut down in mid-March back to January of 2020, we had about three months of samples that had blood specimens that were now stored at the Mayo Clinic Biobank. And the map that you're seeing is where all of those blood samples came from. So over 30,000 blood samples gathered in the early months of 2020 from around the country. Having the ability to study those blood samples and find out where the antibodies might have been present earlier than community transmission of the cases was recognized could be a really powerful way for All of Us to contribute to COVID research. And that study is ongoing right now, use of those biospecimens to see where antibody positivity was present early on in the spread of the disease. The second effort that All of Us has undertaken is an EHR curation effort. Our typical EHR curation is about a three to six-month cycle. But to provide more rapid iteration on data to researchers, we've really been looking at what the proper data types are and how we can gather enough data from all of these different, over 34 different electronic health records to provide to researchers quickly. And then thirdly, a new survey was implemented for participants that was looking at a longitudinal collection on health and well-being. We've recognized that aside from just physical health concerns, All of Us could significantly contribute to understanding the other socio-emotional aspects that are coming in, anxiety, changing routines. And so the surveys undertaken really are aimed at understanding that other well-being aspect that may be happening to these participants during the pandemic. Next slide, please. We're stuck again. So instead of our prior data release timeline pre-COVID, now going into 2021 and beyond, the charge of our team is to understand how we integrate COVID data into that data release timeline, how we can do it quickly, as quickly as possible, but also responsibly and understand the quality of the data and integrate it properly to the Workbench platform. Next slide. So again, please visit Research All of Us to apply to be a beta researcher or pass this on to others in your organization. And next slide. This and the next slide as well are just an example of the many, many different community and provider partner networks. And I'm going to go on to the next slide that contribute to the All of Us Research Consortium around the country and what a huge effort this has been. And I think the next slide is my wrap-up slide. And again, been so happy to share this with you all today and look forward to a discussion with David and Chris as well. Next slide, please. The larger group at our last in-person meeting. Haven't had those in a while. There we go. And again, the next slide is a reminder of that researchallofus.org address. So thank you all for your attention and look forward to speaking more with you in the discussion time. Well, thanks, Dr. Ramirez. You really brought to life the real-world capabilities of the All of Us platform and how you were able to pivot in the COVID-19 times. Thanks to the other speakers as well for their informative presentations. We're going to open up the session now for discussion as well as your questions with our panelists. And again, please submit your questions or comments through the question function of the webinar platform and let me know if you'd like to direct them to a particular speaker. And I'd really encourage everyone to take advantage of this brain trust that we have. I'm just going to kick it off with some of my own questions. And David, you know, your early work in organizing genomics big data seems to have really set the stage for approaches that have now been more extensible to other data types. Can you tell us more about how you think about bringing together disparate data types to make that useful to researchers? Yeah. I think the first thing I'll say is that the differences and thinking about the differences and the commonalities between the different data types and then thinking about the synergies of bringing them together, there's lots of mechanical differences. They're different sizes. They change at different rates. They need different tools and visualizations across all the different kinds of biomedical data types. You know, Chris talked about the ones that his teams are working with and our different projects we talked about. I think genomics came first. Imaging was the other early one to really take advantage of new tech just because they were big enough to break the old way. The old ways stopped working first for genomic data and for imaging data. The benefits of these new paradigms are not unique to genomic data and imaging data. I think the benefits apply across all the data types. We just started with the ones where we were forced to start. Working across data types, I'll start with exactly what Chris said, is you need the different data types together to get the value. The insights come from looking at these different data types together. A few of the studies that Andrea showed were comparing different data types and cross-referencing and binocular perspectives. You see better. You see in 3D. So technically, a lot of the work is the work that Chris described in depth of curating stuff so that you can actually talk intelligently about these different data types from different places. Then, providing a single analysis environment that gives you access to it is the last part of the puzzle. Then, I think it's get out of the way. We have a tagline for Terra, which is focus on your science. We consider it a success when someone doesn't talk to us about Terra. We consider it a success when people don't do presentations like mine. They do presentations like Andrea's. I don't want to hear people talk to me about the software. I want to hear people talk to me about what they've accomplished by using the software to get interesting things done with interesting data. Thanks, David. That leads me to a question for Chris, which is, you've been doing a lot of work already. I'm just wondering what aspects of the work that was already underway were most useful in accelerating the work in COVID-19? I think a lot of it culminated together. I think COVID-19, for us, is showing just to be a far more complex one, I think, for most folks, because it's just so novel, so new. Everyone's trying to figure out what to go after. A lot of our previous work we've done around just inching outside of just straight images, but say images plus labs plus some demographics and trying to go from there. Luckily for us, I think it compiled into the Connect stuff we've been making. Like I said, we started that about a year and a half ago. I think a lot of that was based on just we were seeing so much trouble of trying to get this information locally to come up. We were pretty good at getting local images to come up to us to work with us in the cloud, but everything else that compiled together was difficult. We started building that about a year ago. It's turned out to be, as we've tried to find ways to work with sites without installing Connect, it's showing to be very difficult, just simply because then we're just telling sites, there's just a lot of manual work you have to do on your end, which is limiting. We kind of got lucky in that sense that we were already down this path, and I think it just has accelerated that as focusing in this particular area. Thanks, Chris. Andrea, I wanted to just address the elephant in the room, and that's EHR data, notoriously difficult to deal with, and I'm just wondering how you've been approaching this in the All of Us platform. Is it purely structured data? Are you working with unstructured data? How are you ingesting all of it to make sense of it? That's a great question. I think many of the providers or patients who have been to the doctor over the last 20 years can think of the many different ways data was entered when they were a patient, and or as a provider, the many different ways just in my decade or so of practice, we've put the same thing into different electronic systems. So bringing that all together and making sense of it for the researcher is probably one of the big challenges of this program and many others that are trying to do so. The core of AOU's approach lies with the observational medicines and outcomes partnership common data model. So bringing all of those different sources of data and doing our best to organize them into a common data model with a common nomenclature, common structure to be able to go in and find at least where you think something would be. In our data science team approach, we've learned to look once, look twice, look three times, because even with a common data model, over 30 different electronic health records coming into one model are still going to have things different places. And so that's one thing that's great about the workbench is we can take feedback all the time from researchers. In our alpha phases, our data science team gave a lot of feedback. In the beta phase, we're getting it as well. And we hope to iteratively improve not only how things are stored in the data model, but those support functions for where to go and what to look for. A big scientific question, which I think is a great place for the researcher workbench to go as well is methods development. One of the best things that could come out of making a platform like the workbench and the dataset now available is the answer to your question. Are we finding what we're looking for? We've done very, very much the tip of the iceberg so far on validating and understanding the data that we have. So we're really looking forward to the community diving into this dataset and helping us understand exactly what we have so we can think about five years, 10 years, 15 years from now, what that looks like. So great question and an exciting opportunity for researchers now. I wanna build on that. Yeah, go ahead, Dr. Burstein. I just wanted to build on that question because I think it's the pivotal one for all the registry sessions we've had so far. And I think one of the big issues tends to be what may be, it seems like it'd be easier to have a common data model when data are structured. I think the complications certainly for something like COVID is how much of the data in fact are perhaps more just free text. So the example that came up a couple of webinars ago across several of the clinical registries across specialty societies is, how do you know that a cough is a cough is a cough? Whether it's a surgery registry, critical care medicine registry, or how does Chris use that sort of clinical data as part of their work on trying to understand the images? Any thoughts about how you kind of move beyond data models around structured data and what does it take to move us forward in terms of harmonizing the key elements for even those unstructured data using COVID as the example? That sounds like a great question actually for all three of our panelists. Yeah, I apologize if we're gonna go into the unstructured aspect, but I'm happy to address briefly. The next step all of us is taking is extraction of concepts from free text. So we'll actually be reading the free text ourselves and extracting those concept labels at the data model level back to make available as structured data. How we get to free text I think is an open question, involves a lot of policy aspects as well, making sure that the privacy of participants is protected. When you look across other efforts at times, we actually maybe can't do as good of a job with a machine reading free text and at times chart reading and chart review and chart curation is necessary. The ultimate goal is to provide that ability to researchers to go in and do their own chart review. Getting there is gonna be a big leap, but within the next year's timeline for the program to figure out how to make that ability available. Chris, do you have thoughts on sort of mining real world data concepts like cough? Is it a significant cough? Is it a COVID related cough? I think you're still on mute, Chris. I got it now. No, it's a difficult problem to deal with. I think that's what we always have to go in with eyes wide open is there's not gonna be a magic bolt that solves it. I think there's a variety of ways that we're looking at solving it. One just being everyone, like everyone throws a NLP to try and extract it out of there. You're always gonna have to have some level of just known error rate to it and just have to accept some of that. We're also doing lots of pushes in other areas that you're just trying to move away from free text and every kind or try to do what a lot of what we're seeing vendors do inside of their free text areas, which is discrete fields mixed with free text to try and do it. But a lot of times what makes it even more challenging, I think, is that cough is a cough is a cough at one location. What happens when you extract it from another location and it's cough alpha versus a cough alpha versus a cough alpha and a cough alpha So you get that problem too. And I don't know if we're gonna be able to ever go around the fact we'll have to do some sort of localized mapping across it. But I think there is a general movement towards once we discover as a medical industry that this is an important aspect and why it's important, you start seeing the movement towards actually making that discrete and part of the field off of it. And David, I wanna get your thoughts on this and build on a question from, and I'm probably gonna mispronounce the name, Shikha Kotari, who asked, is there a general move toward the common data model for all repositories and registries? Well, I'll start with the question first, which is, and I'm reminded as I often am on technical things of a particular XKCD cartoon, which, and this is the one that starts with a group of people getting together and saying, you know, the problem is we have 14 different standards for solving this problem. And then they say, and then the second panel is, aha, we have an answer. There are now 15 different standards for solving this problem. So I am both optimistic about any work that people do to provide harmonized data models that are better and pessimistic that there will ever be a single standard or model to rule them all. So I think it's, you know, keep working for harmony where possible, but don't block your success, our success as a field on getting everyone to agree on the one true ontology or the one true coding or the one true terminology, because it's not gonna happen. Zooming out to the larger question, I think I come at it by looking at what's happened in other domains. And maybe the easiest domain to look at is the web and thinking about how do people find information on the web. And some of you all will remember that in the, you know, the early days of the web, there was this company called Yahoo that had built a hand curated, manually maintained, beautiful, rich, thoughtful hierarchy and directory of content on the web. And it was a nice ontology of the web. And it was a nice way to find information and you can browse it in a catalog way and get, and then there was this other company started in some, you know, started out as of Stanford that said, yeah, let's just like take, let's embrace the messiness, take all the raw text as text and see what we can do with it. And in that case, it's very clear that as scale goes up, the answer is find a way to embrace the messiness because you can't keep up. So the question is, can we learn from that in the field of biomedicine? Where is that relevant and where isn't it relevant? Cause there may, it doesn't necessarily transfer. And I think there are two things to learn. One is about upstream labeling and the other is about listening to the data. So I'll do them in the reverse order. A lot of the advances in information architecture, information retrieval, information understanding, audio understanding, natural language understanding, meaning extraction, a lot of the advances there over the last decade have all been, as you all know, have been through machine learning. And the fuel for machine learning is sufficient training data, sufficient data to look at from which you can start to extract signal from messiness, signal from noise. So to the extent we want to embrace that approach in biomedicine, we are going to be dependent on the kinds of very large databases that programs like all of us are building. And as we work through the policy issues that Andrea is describing, it will become very fruitful to say, all right, we've invested in the hand coding and we have the messy text. How can we use that to improve automated understanding? So that's one avenue. And I think any work we do to build bigger databases of the messiness are going to be how we learn to embrace it. The other avenue and the other thing that has happened on the web to great success is there's a whole world on the web of webmaster tools where the people who are producing the data have learned that, you know, if I put a couple of hints in my page and label what this is and say, hey, this page is connected to that page, then search engines will listen to the hints and they can do better if you know upfront that there's going to be someone trying to analyze it downstream. And I think we have a lot of opportunities in biomedicine to say, hey, what can we do at the time of data entry, at the time of data capture to help the data be richer, knowing it's going to be used downstream in these aggregated, harmonized, long-term research projects? Even something as simple as timestamping all text notes as they're entered automatically or transcribing things where you know all that, right? Some of that happens, some of it doesn't. That's not sufficient, but that's an example of something that if you knew a machine was going to help you make sense of it afterwards, what would you do upfront? And I think some of the work that some of the image, some of the content embedded in images is an example of getting this half right. Chris, I'm sure you've done an awful lot of work extracting text and numbers from images, some of which better not ever be there when you share the images, some of which is essential for you to know what to do with the images. Wouldn't it be nice if someone had just like stuck that on the outside of the image and put it in a little record that said, hey, here's the data and here's the image. That's a small example. You and others are working around it, but I think we can do that. So those are the two approaches I'd take. Get the data to help us learn from the messiness and give us hints in the data knowing what we're going to do with it downstream. Well, we appreciate your cautious optimism, David. I did want to share a humorous comment that Colleen Skow, a volunteer, she said, Dr. Jimmy Chang always says standards are like toothbrushes. Everyone agrees they're necessary, but no one wants to use anyone else's. Back to our serious questions. This one is directed at Andrea and it's around how about economic evaluation or health technology assessment of big data in programs like All of Us? Sure, I think there's a huge opportunity there. We have the beginnings of looking and I'm not sure exactly which direction that question was going. Because we have such a diverse sample and we're recruiting from all over, not only all over the country, but all across those underrepresented categories, we can begin to tease apart what economic stratification of data looks like. And when you say not only what they know from their income, but where they receive their care and start looking at other things like overlying geolocalization to food availability or food deserts. I think that this kind of data set, if we can begin to link some of those outside data sources is going to become even more rich. So I think going in that direction for investigating some of those more socioeconomic aspects will be very, very rich when we can start bringing in some more outside data that we'll add to that. In terms of health technology, the program in and of itself is an experiment in how technology can be used in research. The premise of this study was enroll participants once, go get their electronic health record and we'll make do with it going forward. We'll kind of figure that out and give everybody an email address and we'll just be able to ping them with snap questions and surveys and get that all back. Not surprisingly, but possibly more so than anticipated. Disparities and economic disparities and rurality, urban disparities are really decreasing the response rates in outreach and engagement and retention of participants versus those more represented, more technologically savvy participants. So that's a huge opportunity for the program, again, to contribute to what the role of technology should be in research going forward, as well as how do you work with a dataset with missingness? So we see it as an opportunity, we see it as a learning process that's happened in the program and the science committee and the steering committee at both levels are really digging in on what the role of technology is in engaging participants, I think is the bigger role there. And then a small part that is engaging with digital health technologies. So getting in Fitbit data, getting in Apple Watch data, those things I think will become increasingly important and interesting. But again, in a small subset of participants for now, while we pilot those technologies, learn about them and get them into the data model for reuse on the platform. Thanks, Andrea. Our final question is somewhat of a technical one and maybe I'll direct it at Chris first and then the others can comment. This again is around data standards from Sarah Tan. And this is really, how are you all engaging in the problems with healthcare data standards, real-world data uses in EHRs versus life sciences using CDISC? And the point here is that ONC CMS mandate to bring ownership engagement of data to the patient, there'll be a much greater need to standardize the real-world data for real-world evidence. Life sciences will be charged with bridging better care outcomes. HL7 is trying to help bridge the divide. And just wondering how are you grappling with these data model differences? Yeah, so one, I'll just throw this out there right away is the hard part with healthcare standards is like, so you can speak HL7. Does that mean you can speak anyone to HL7? Probably not. They're a standard in the sense of it's a wide variety of ways that you can package things up and go down it. So a couple of ways off of that. One is we're engaging, one, we were directly engaging with the standards bodies, the SDOs, HL7 and DICOM. We have a director of interoperability here at the ACR helping to advance a lot of these and push forward a lot of the different proposals and standards around it. Two, at least in our domain of radiology, we're pressing a lot on the discrete elements. So like one of the areas of medicine that's probably the best interoperability is labs. And part of that is LOINC. And because a lab's a lab, it has a LOINC code behind it, you can transfer it across. It's very discrete. Versus other areas, it's more of, well, it's a big blob of text and you go figure it on your side. And so one of the things we're doing there is we have a joint effort with RSNA and many of the subspecialty societies called red element to help define discreetly peers findings and details inside of radiology. Going a bit beyond that as well, what we're doing as well is also putting primarily on the red element aspect, but putting up profiles of here's how we would expect someone to communicate, say a radiology degree finding in buyer in HL7 V2 in DICOM, DICOM SR, DICOM web. And that way we can preemptively try to get out ahead of it before a bunch of vendors do stuff that's just their own variation of it. And I think just the last part of it is what maybe puts us in a semi unique position is because we're just, we're inside of all this. We've been collecting data at the sites now for decades. So we've talked basically every flavor of DICOM you possibly can, getting involved with HL7 and FHIR a lot more. And by being inside the game of it and having those exchanges with vendors, you can start influencing them by just having, by being there onsite and something they wanna do. And you can start pressing on things that way to make things better, but it'll always be a long road to get there. And I think part of it is because some of it, when people say standards, we're saying, well, can you communicate? Can you talk FHIR HL7? So can you package it the right way? And then there's another aspect of, well, can you encode the information like a LOINC code or SNOMED code for us, red element code correctly? And that's less on the STOs and that's more on specialty societies. Or, and the other last part too, is just will vendors put implementations together that actually work together off of there? And there's just a lot of different moving parts behind it. One of the other aspects I'll throw out there too, just in terms of like short term is going out. One of the things we've had great success with, especially in research type deals, because a lot of healthcare standards are very much made around operations. How do you just manage your clinic? How do you manage your surgery area? Which tends to be a very different problem set for how do you extract a bunch of data and search through in different ways? We've been finding that working with vendors, essentially in somewhat custom interfaces, but not that custom. Things like, for like the MDRC, when we're talking a lot of vendors say, here's the data elements we want. Can you extract this from us, from your system in a CSE file, a JSON? Something relatively simple that can give us enough as we're not telling a user to go scrub through your system manually and pull it out. And we're seeing a lot of success in there. So just simple things like that can get us moving pretty fast. Well, thanks, Chris, for those comments. And thanks to all of our speakers for their valuable perspectives. And I'll turn it back over to Helen. Turned off my picture, but trying to unmute myself. Hi, everybody. Thanks again. That was really extraordinary. I've been trying to keep up with it, trying to put some on Twitter. And I did find that XKCD cartoon, David, and tweeted it. It's just perfect. So everybody can see it. Again, thank you all. Thank you, Bill. Mark's apologies. We couldn't see him. We couldn't get his camera working, but obviously he was a powerful moderator, even in voice only. So thank you all. In voice only. So thank you again, as you'll see here. The recording of this webinar is available. Feel free to share it with others. And you'll get a short evaluation as well to follow. And again, special thanks to all of our panelists, as well as support from the Gordon and Betty Moore Foundation and our partners at the AAMC. Next slide. And just in the final moment, just to let you know, we've got two more of these to go. And thanks to Bill for also being on our planning committee. Two more of these webinars to go. August 12th, Prioritizing Patient Engagement and Inclusion of Patient-Generated Data, which I think will be interesting for many of us who are talking about this work today, led by Susanna Fox and a group of really remarkable patients who will bring their perspective of what is doable by prioritizing and including patient-generated data and the patient voice with some focus, for example, on long-term implications of COVID. And Dr. Ramirez specifically mentioned, for example, the longitudinal survey around patient wellbeing, questions like that from a patient's perspective. And then finally, we'll have our last webinar in this series on September 1st, how do we use clinical registries to address disparities in COVID-19, something we didn't talk a lot about today, but hope to bring home as part of this effort. So we're almost at 1.30. Again, thank you all so much for this remarkable effort here. The speaker's actually for our next session on August 12th. And I think we've learned a lot and a long way to go. Thank you all. Take care. Bye-bye. Bye.
Video Summary
The Council of Medical Specialty Societies (CMSS), led by Helen Burstyn, has been running a series of webinars in collaboration with the AAMC and supported by the Gordon and Betty Moore Foundation. These webinars are focused on advancing clinical registries to support pandemic treatment and response, with particular emphasis on leveraging data through rapid cycle learning and development amidst the COVID-19 pandemic. This specific session, the fourth in the series, concentrated on deploying cloud-based platforms and analytic tools to aid COVID-19 efforts.<br /><br />Experts David Glazer from Verily Life Sciences, Chris Tremel from the American College of Radiology, and Dr. Andrea Ramirez from Vanderbilt University shared their insights. David Glazer highlighted the capabilities of large-scale, cloud-based data platforms such as Terra, emphasizing collaborative workspaces and the integration of diverse data modalities. He also discussed the NIH All of Us Research Program, showcasing its scalable, secure platform for biomedical analysis.<br /><br />Chris Tremel detailed the federated model used in the MIDRC project, a virtual registry involving multiple organizations that focus on COVID-19 imaging and research. He described the complexities and governance required for multi-entity collaborations and introduced ACR Connect – a platform aiding the integration and anonymization of diverse data types from healthcare systems.<br /><br />Dr. Andrea Ramirez presented the efforts of the All of Us Research Program in gathering data from over 224,000 participants, with a focus on underrepresented groups. She highlighted their approach to handling and validating various data types, including survey data, physical measurements, and electronic health records. She also discussed the program’s rapid pivot to include COVID-19 relevant data, such as serology studies and rapid EHR updates.<br /><br />Together, these speakers underlined the transformative potential of cloud-based data platforms and federated registries, emphasizing collaborative approaches to integrate and analyze diverse data types effectively.
Asset Caption
This webinar will provide perspectives on:
The promises of multi-dimensional data platforms
What analytic tools and techniques are available within and outside of healthcare that can be leveraged
How the lessons learned from Covid-19 can be applied more broadly and to the next generation of clinical registries
Keywords
CMSS
Helen Burstyn
webinars
AAMC
Gordon and Betty Moore Foundation
COVID-19
cloud-based platforms
David Glazer
Terra
MIDRC
All of Us Research Program
federated registries
×
Please select your language
1
English