Leveraging open data for Risk Scoring of medical practitioners

This post discusses how we used publicly available information to derive a risk score for medical practitioners in New Jersey. Before we go any further – lets define what we mean by Risk Score.

Malpractice refers to negligence or misconduct by a professional person, such as a lawyer, a doctor etc. Among physicians, malpractice is any bad, unskilled, or negligent treatment that injures the patient. Our objective was to 1) develop a predictive model to score medical practitioners in New Jersey for “riskiness to medical malpractice”, hence defined as Risk Score and 2) understand the potential drivers of such a potential risk.

The quantification of potential risk is one of the most complex management challenges faced by insurance companies or policy makers when providing finances to medical practitioners. Our approach of using public data is a step towards reducing the uncertainty in this area and seeks to harness the vast data features aggregated across multiple open sources along with analytics & predictive modeling to identify potential drivers of risk.

Based on the publicly available data features, we defined malpractice as follows, any:

  • Disciplinary actions
  • Sanctions/Board actions
  • Debarments/Disqualifications

In identifying the data sources, we ensured that the sources used were credible and persistent. Some of the sources we looked into were:

  • CMS Open payments
  • New Jersey Consumer Affairs
  • Office of Inspector General (OIG)
  • National Practitioner Data Bank (NPDB)

Various other sources were used to extract information / potential drivers about medical practitioners in NJ like:

  • Medicare & Medicaid
  • County Health rankings
  • Clinicaltrials.gov
  • NIH.gov
  • FDA

Further Associated Hospital & Epidemiological characteristics like Affiliated hospital rating, Average healthcare cost, Readmission rate, %Uninsured population etc. were also collected for the given zipcodes.

The data flow diagram below shows the steps taken to ensure data collected was adequate, from credible sources, was rigorously quality-checked and normalized before any analysis was performed.  

In the end, we had approx. 70% coverage as shown below:

With the given public information, we were able to achieve 65% accuracy.

Some of the potential drivers or factors identified in the analysis are as shown.

While this analysis doesn’t give a fool proof method or score for risk but the risk features identified and scores along with potential drivers can greatly help providers in making more informed decisions. The model we built using external sources is only the beginning. Further enhancements are possible and are being worked on by our team by adding paid and syndicated sources and expanding the coverage to states other than New Jersey in the US.

Screenshots of the Dashboard (for Illustrative Purposes Only)

Some of our earlier blogs about the use of Open/Public Data for different verticals were following:

Learn more about CrowdANALYTIX solutions here.

Leave a comment


Mohan 1 year ago

Its made using d3.js, html5 & other js web technologies .. Thanks


Guest 1 year ago

HI CAX Mohan S,

The blog is informative but I wanted to know that which tool have you used to create the charts.