February 2021 - Health Data Integration

Introduction

In this article we look at how Neo4j can be used to analyze healthcare provider networks.

Some domains in healthcare are dynamic. They require constant changing of the data model. An example of this is insurance products and provider networks. Insurance products have properties that are updated from year to year. Some groups are added , removed and their properties are modified over time. Provider Networks also undergo changes from time to time. This can include adding or removing providers, changing in/out of network statuses.

For an organization that is analyzing this information, having rigid database schemas slows down the speed at which useful analytics products can be created.

One possible way to allow this dynamic nature of data models but still have a way to analyse the data quickly is to use a Graph database. Neo4j is one that comes to mind.

The Project

As an example , we can use an openly available data set of some of New York’s providers that ca be found here. We will parse this data into a staging database (One column to hold the dynamic fields). We will then load the data into a Neo4j graph.

In this data-set, we will use the visualize the following domains:

Health Providers
States/Counties
Health Insurance Plan Names
Network Indicators – These show different groupings that each plan & provider are a part of.

We will 1st need to transform the data a little including some unpivoting to dynamically convert all the indicator and network columns to rows. We will use dbt to do this. I am a strong advocate for the use of sql based tools like dbt that can be used by majority of data professionals. Source Code can be found here

Then we will load the pivoted data into the graph, adding new nodes and edges (relationships) as we come across them. (See neo4j cipher queries used in the github link)

Some screen shots of the resulting graphs:

Creating stable data pipelines in healthcare can be quite challenging for a number of reasons. In this article we will review at some these challenges and propose an opinionated healthcare integration pipeline that can help mitigate them.

TL ;DR Go straight to the Pipeline

Current State And Possible Solutions

Lack of a standard approach

It is understandable that every healthcare integration project has its own niche and market. This makes it difficult to find a tried and tested methodology for creating a pipeline. However, it is still important to stick to some principles that are applicable to most situations. One such principle is to focus on standardizing the different sources of data so a single (or just a few) pipeline(s) paths can be used

Duplication/ difficult reuse

Without proper planning , data pipelines can degenerate into a mess of customizations that are difficult to reuse.

A good principle to use to prevent such a situation is to apply changes at a few defined levels consistently. For example if calculating some patient risk scores, 1st apply a general scoring logic (reusable across all sources) that is not source specific, then if needed, apply a source specific scoring that can override the general one. These levels should be maintained separately so they can evolve independently.

Reuse can also be enhanced by using configurations ideally stored in a database. The pipeline stages can query the configuration to find out which components of logic to apply to the data. This can reduce the additional pipeline changes needed with every new data source added in the future – The pipeline stays constant as configuration evolves.

Scalability Issues

A solution to scalability problems is to use an event Driven approach. Each step picks up its task from a messaging queue and also sends its results to a queue. Apache Kafka is a good option for this type of setup.

Message queues are better suited for streaming data sources (say hl7 v2 or Fhir), but they can also be used for batch loads if only the job metadata and not the message itself are put in the queue.

Using configuration driven approach mentioned above can also help avoid a ‘pipeline per data source’ anti-pattern.

Moving the Datawarehouse to cloud distributed solutions such as Redshift should also help scale.

To public cloud or not?

The concerns about using public cloud providers in healthcare are getting less with time. Healthcare data being very sensitive (see HIPAA).

Regardless , architecting pipelines based on micro-services that can be deployed in-house, in private or public cloud without a lot of changes should help with this.

Monitoring & Alerting

Pipeline Metrics and Logging are crucial when especially when using horizontally scalable solutions. It is important to be able to view say the counts of data being processed at each stage and processing node and any logs of errors from a centralized dashboard. If this is not done , it is difficult to get notified when something goes wrong.

Skills Shortage

It is difficult to find enough expertise on all the technologies needed to run a modern data pipeline. To reduce the need for specialized knowledge, use of sql based tools (see dbt) can help.

Month: February 2021

Neo4j for Healthcare Provider Networks