Using FHIR with Google Cloud, BigQuery, and the AI Platform

Peter Lubell-Doughtie
January 30, 2020

From our XLSForm definitions in Ona Data to our JSON Forms in OpenSRP, the work we do at Ona is built upon community driven standards. As we further integrate HL7 FHIR (Fast Health Interoperability Resources) into OpenSRP, we’ve been exploring the existing tooling available. In this post we’ll dive into the Google Cloud Platform (GCP) Healthcare offering and run through an example where we:

  • Generate representative patient data in the STU3 FHIR format,
  • Load that patient data into a FHIR store,
  • Observe data in the FHIR store using BigQuery,
  • Visualize and analyze those queries in the AI Platform.
  • Export charts and machine learning models.

Our GLOVE approach to healthcare analytics transfers to other platforms and toolsets as the situation requires, including on-premise hardware. GCP lets us quickly set up a workflow to prototype new ways of exploring healthcare data. OpenSRP lets us ensure that, whether gathered at a household or in a facility, the data we are collecting has a well defined and FHIR compatible data structure.

Demo requirements

To follow along with the demonstration code below we will need the

  • Google Cloud SDK, and a
  • Google Cloud Platform account (the free trial will do).

We can use the link above to install the Google Cloud SDK and then run

gcloud beta healthcare

to install the Healthcare beta package.

Install and update Synthea

Synthea is a synthetic patient data generator program capable of generating health data for specific patient populations in various FHIR formats. Below we’ll clone the Synthea repository to our local machine and build it.

git clone
pushd synthea
./gradlew build check test

Enable the STU3 FHIR format, disable other FHIR formats

Here we adjust the Synthea configuration to generate data in the FHIR STU3 format. We can also use the DSTU2 format, but we must choose one or the other for this example.

vi src/main/resources/

Using the text editor of your choice, let’s open the file and adjust the content as below.

exporter.fhir.export = false
exporter.fhir_stu3.export = true
exporter.fhir_dstu2.export = false = false = true = false

exporter.practitioner.fhir.export = false
exporter.practitioner.fhir_stu3.export = true
exporter.practitioner.fhir_dstu2.export = false

Generate health records in Synthea

Now that we have installed and configured Synthea, we will use its commands to generate data. The Synthea command line can generate location agnostic patient records, or alternatively accept a State and or City located in the United States, to generate location specific patient records. Below we pass a location and the -p option to generate ten patients representative of those seen in Eufaula, Alabama.

./run_synthea -p 10 Alabama Eufaula

Set our variables

Below we’ll set a number of variables that we refer to in the later commands run against the GCP SDK. Either create a new project or use an existing project, then read the ID and number from the project dashboard. In LOCATION make sure to choose a region that supports the services Healthcare, BigQuery, and AI Platform. Below is a breakdown of GCP feature support per region at the time of publication, January 2020.

Healthcare BigQuery AI Platform

These are the variables to set in the terminal for the remaining commands.

PROJECT_ID=[our project ID]
PROJECT_NUMBER=[our project number]
BUCKET=[a bucket in GCP]
FOLDER=[a folder in the above bucket, doesn't need to exist]
LOCATION=[a GCP region]

Feel free to adjust the names of the FHIR_STORE and DATASET_ID as desired.

Grant permissions

Here we grant permissions to the Cloud Healthcare Service Agent on the storage and BigQuery roles that it will need to access.

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:service-$ \

gcloud projects add-iam-policy-binding $PROJECT_ID \
    --member=serviceAccount:service-$ \

Create healthcare dataset and FHIR store

First we create a healthcare dataset. The healthcare dataset exists in a specific region and can contain multiple different datastores of different formats within it. We can also use the online console to view, create, or modify healthcare datasets.

gcloud beta healthcare datasets create $DATASET_ID --location=$LOCATION

With the below command, we create a FHIR store within our healthcare dataset.

gcloud beta healthcare fhir-stores create $FHIR_STORE \
  --dataset=$DATASET_ID --location=$LOCATION

Upload synthetic FHIR records to a bucket

Before we send our records into the FHIR store we have to push them to a bucket in cloud storage. The gsutil command below sends all the records we generated through Synthea to our cloud storage bucket.

gsutil -m cp output/fhir_stu3/* gs://$BUCKET/$FOLDER

Move records from a bucket into a FHIR store

Now that we have our records remotely available in a cloud storage bucket, we use the following command to transfer them from that bucket to our FHIR store.

gcloud beta healthcare fhir-stores import gcs $FHIR_STORE \
  --dataset=$DATASET_ID \
  --location=$LOCATION \
  --gcs-uri="gs://$BUCKET/$FOLDER/**.json" \

This command will block until complete and end by returning the number of successfully transferred records.

Create a BigQuery dataset

Here we create an empty dataset in BigQuery that we’ll later transfer our FHIR data to.

bq mk --location=$LOCATION --dataset $PROJECT_ID:$DATASET_ID

Export from a FHIR store to BigQuery

Now we’ll use the following command to transfer the FHIR records in our FHIR store into the new dataset we’ve created in BigQuery. This command will also block, but we can pass the --async option if we want to run it in the background.

gcloud beta healthcare fhir-stores export bq $FHIR_STORE \
  --dataset=$DATASET_ID --location=$LOCATION \
  --bq-dataset=bq://$PROJECT_ID.$DATASET_ID \

After transferring our data from our FHIR store to our BigQuery dataset, we can go to the BigQuery console to see the tables we created, preview the schemas and contents of the tables, as well as run SQL queries against the tables.

Explore BigQuery data with AI Platform Notebooks

For a more flexible, powerful, and persistent approach to exploring the BigQuery FHIR tables, we can use Notebooks in the AI Platform to make queries directly against our BigQuery datasets.
If we launch a Python 3 Jupyter Lab notebook, we can use the following code to run a query directly against BigQuery:

from import bigquery
import pandas as pd

project_id = "[PROJECT ID]"
fhir_store = "[FHIR_STORE]"

client = bigquery.Client()

    SELECT code.coding[safe_offset(0)].display As Condition,
           DATE_DIFF(CURRENT_DATE, PARSE_DATE('%Y-%m-%d', birthDate), YEAR) as Age
    FROM `{project_id}.{fhir_store}.Condition` AS Condition
    JOIN `{project_id}.{fhir_store}.Patient` AS Patient
""".format(project_id=project_id, fhir_store=fhir_store)

In the above code, update project_id and fhir_store to match what we used in our command line variables. This query will join all the code.coding.display values from the Condition table with patient ages from the Patient table, as calculated using the birthDate values and the current date.

Notice that [safe_offset(0)] is required to handle the nested column names containing periods (.) that result from the manner in which BigQuery converts nested FHIR resources into table definitions. Using [safe_offset(0)] is also valuable when running SQL queries directly against BigQuery in the console.

In the code below, we execute this query against our BigQuery client and convert the results into a Pandas dataframe. Then we take that dataframe and generate four age bins, equally split by frequency, from our Age column.

df = client.query(condition_age).to_dataframe()
df['AgeBins'] = pd.cut(df.Age, 4)

df.groupby(['Condition', 'AgeBins'])['Condition'].count().unstack('AgeBins').fillna(0).plot(
    kind='bar', stacked=True)

Finally, we create a stacked bar chart with the conditions on the x-axis, the frequency on the y-axis, and the grouping colors representative of the age bins. The legend matches the colors to their age bin, where brackets (]) denote an inclusive interval and parentheses (() an exclusive interval.

With minimal effort we’ve described and scripted a repeatable process to generate data, transfer it into cloud storage, from there into a FHIR store, then into a scaleable database, and programmatically analyze it.

As more data becomes available in FHIR, we are excited to have tools like this to Generate, Load, Observe, Visualize, and Export it.