Digital Health CRC data scientist Dr Amir Marashi spent over a year getting familiar with an extraordinary dataset supplied by our core participant HMS Healthcare – and says that there’s enough information in the data to deliver a lifetime of research findings.
The dataset comprises hundreds of millions of limited paid Medicaid claims records from 3.9 million different health providers, and over fifty million patients, across ten US states, since 2015 – more about the dataset here.
“It’s an enormous dataset, with so many different features and tables linked together, and we receive an update every month,” says Amir.
He says that COVID has changed the number of claims received for different medications, and the use of telehealth has increased dramatically since March.
“There are so many attributes to this dataset, and observing the way the data has changed in 2020 has been fascinating,” he says.
A number of DHCRC projects use the data for their research, and rely on Amir’s expertise to get the most from the data.
Real-world experience
Amir says that the 70-plus people taking part in this month’s DHCRC, HMS and Ronin Telehealth Datathon will find the dataset challenging to work with – but they will also get some extremely valuable experience working with real-world data.
He says that there are three key learning aspects for our datathon subjects.
1. Challenging dataset to work with
The dataset has many unique features, including lots of interlinked tables.
“As with most large real-world datasets, there’s some pre-processing and data cleaning that you must do in order to work with the data,” he says.
Amir must spread his expertise over 13 datathon teams, so he acknowledges this week will be ‘crazy busy.’
2. Huge dataset size
Because the dataset is so enormous, a simple task which would normally take a second or two in an average dataset will take much longer – so teams will need to learn to plan their work carefully.
“Working with a dataset of this size requires some specific software and techniques which the users will have to learn,” he says.
3. Entirely Cloud based
The dataset is not public and can only be accessed under licence and following a set of approvals. “The data cannot be distributed on personal computers,” Amir explains – he says it is true cloud computing.
“Most of our researchers would be used to extracting data to work on at their own computers using something like Excel – this is not possible with this data.”
“Users must all learn to work in a virtual space, in this case AWS.” AWS – Amazon Web Services – is the world’s largest on-demand cloud platform.
This is where datathon project partner RONIN comes in, he says. RONIN’s dashboard is designed to help university researchers navigate the use of AWS cloud computing and storage without facing a steep learning curve.
Diversity will bring new perspectives
Amir says that although he’s spent over a year getting familiar with the data, he expects our datathon participants will find some great new aspects to uncover.
“Our thirteen different teams each have health topic specialists as well as their own data scientists,” he explains.
Amir says that when the health data is explored by someone with ‘domain knowledge’ from a particular specialty, they will think of linking information that will uncover some really useful results.
“All our teams have access to the whole dataset – and we anticipate they will come up with ideas we have never thought of before, and bring some very interesting outcomes because they have so many different backgrounds and views,” he says.
“When you get such diverse groups of people looking at data, the results can be surprising.”