Skip to content
Snippets Groups Projects
Name Last commit Last update
code
data_files
README.md

README

Requirements:

  • cTAKES 4.0.0.1 installed
  • custom NLM Library installed
  • python 3.6+ installed

Preprocessing Clinical Notes (Non-radiology notes) and Radiology Reports

Non-Radiology Notes

The UW non-radiology notes require preprocessing input to the cTAKES pipeline. The relationship of key columns for these data were:

Note Line, Note index -> Note ID -> Encounter ID -> Patient ID

Multiple note IDs existed for a single note that was split into parts defined by a Note Line and a Note Index, one or more note IDs corresponded to an encounter, and one or more encounters corresponded to a patient. When viewing the above relationship in reverse, it was possible to establish and use unique identifiers for one or more tasks; for example, each patient ID was unique for a set of encounters. The exception to this relationship is the Note Index and Note Line identifiers for a Note ID were not necessarily unique and required the pre-processing steps below.

The note indices were out of order and did not correspond to the correct ordering for a complete patient note, and frequently contained note index entries with duplicated text. A synthetic data example is in the table below and illustrates the typical ordering of notes:

Note Index Note ID Note Line Encounter ID Patient ID Note Text
2 11112 1 E0001 P1234 of mental agitation, and initially refused examination. Physical examination
3 11112 2 E0001 P1234 The patient arrived and showed signs
1 11112 3 E0001 P1234 showed swelling and bruising on the third metacarpal of the right hand, and tenderness in the left ventral thorax.
4 11112 2 E0001 P1234 ADDENDED NOTE: The patient arrived and showed signs
5 11112 3 E0001 P1234 swelling and bruising on the third metacarpal of the right hand, and tenderness in the left ventral thorax.
6 11112 2 E0001 P1234 of mental agitation, and initially refused examination. Attending Physician Robert Smith advised drug test before proceeding with patient. Physical examination showed

In the above example, the correct order would be Note Index 3, Note Index 2, Note Index 1, then the modified addeded note ordered Note Index 4, Note Index 6, Note Index 5.

The following pattern was identified:

  • If there were less than 3 note index entries, then the order was usually index 2, then index 1
  • If there were more than 3 note index entries, then the order was usually index last index-1, then 1:end_index
  • Addended notes usually were the most updated and "correct" version of the notes, and followed a similar pattern of starting at last index - 1
  • If a note contained a modified or addended note, the note had duplicated note lines.
  • Duplicated note line numbering across Note IDs shared no consistent patterns.

Code is the correct, ordered, and complete medical note for each Note ID. This code has comments at the beginning of each block that describe the steps and reasoning with deduplicating and ordering. In instances where the algorithm in the code could not definitively determine the order of a note and/or which note indices were duplicated entries, it made no modifications to that note. This process was subsequently validated by visual inspection on a sampling of notes.

Additional code was required to remove the remaining duplicated note entries. The algorithm there scanned for duplicated note indices within a note using anchor text at the beginning and end of each Note Index, and if a pair of Note Indices were determined to be duplicates, it randomly selected one Note Index to discard.

Radiology Notes

The notes within the Radiology notes had two separate issues. The notes had no Patient IDs and missing timestamps.

The only identifier for the Radiology Notes was the encounter ID column, so artificial identifiers were created using a derivative of the encounter ID. An example is shown below:

Encounter ID Order Time Perform Time Result Time Note Text Generated Unique Note ID
E0001 04-28-2022 10:59:02 XRay film of right wrist E0001.1
E0001 04-28-2022 11:59:02 04-28-2022 12:37:41 XRay film of chest E0001.2
E0002 04-26-2022 09:39:05 CT Scan of abdomen E0002.1

cTAKES processing requires unique identifiers for each note, so the Generated Unique Note IDs sufficed adequately. For patient-level analyses, it will be necessary to map each radiology encounter ID to an encounter ID:Patient ID in the non-radiology notes dataset; however, this assumes that all patient IDs within the non-radiology notes dataset will have a corresponding radiology notes encounter ID that is present within the non-radiology notes.

Timestamps

As shown in the previous bullet point, the Order Time field was the only field that contained timestamps for every note. The stats for missing timestamps in the Perform Time and Result Time have been uploaded to this repository. This requires inspection from source data at data warehouse/Clarity.

Additionally, the non-radiology notes had only one timestamp field, which also contained missing timestamp values. The stats for these missing timestamps have also been uploaded to this repository.

cTAKES NLP Pipeline Overview

Following the preprocessing steps listed above, the medical notes were ready for the processing and analysis steps in the cTAKES overview.

The cTAKES workflow takes as input a (large) input CSV file, splits it into individual files, then sorts the files by size for processing. The size bins are:

  • < 10kb ("small")
  • >= 10 kb and < 20kb ("medium")
  • >= 20 kb and < 40 kb ("large")
  • >= 40 kb ("super-large")

The size of files for each bin was determined by previous work by Majid, and from recent analyses that showed exponential time complexity for the size of files being processed by cTAKES. Generally speaking, if you have limited CPU resources, the "super-large" and "large" bins of files should be completed before the small and medium size bins.

The bins of files are further split to have no more than 25000 files each, which allows streamlined parallel processing. For example, if there were 100,000 files in a dataset, where 10,000 were medium size, 10,000 were large size, 5,000 were super-large size, and 75,000 were small size, the resulting split directory structure would look like this:

small_size/

small_size_1

small_size_2

small_size_3

medium_size/

medium_size_1

large_size/

large_size_1

superlarge_size/

superlarge_size_1

The python script will create this directory structure in the work directory, and the parallel bash script will look for this directory structure when starting cTAKES instances. After splitting the files, the parallel bash script will start cTAKES processing on the split files.

After completing cTAKES processing, the output files will be in compressed format (*.xmi.gz) in the corresponding output directories of the cTAKES instances. They need to be copied to the flatfile generator folder, which then will convert the XMI output to ||-delimited text flatfiles. The output flatfiles can be over 50 GB in size, so it may be advisable to compress these.

Usage

Note1 from John (03-24-2022): The code in the code/ directory was used to process the radiology notes from start to finish, hence the naming. I'll be updating this over the next few days, but the basic steps and usage will not change much.

First, run parse_radiology_notes.py to split the initial CSV file. Update the NOTES_FILE= variable with the correct file name, and update PSEUDO_PAT_ENC_CSN_ID at pat_id_list = df['PSEUDO_PAT_ENC_CSN_ID'].to_list() to have a unique identifier.

Then, run parseEntries.cTAKES.Notes.getFileSizes.v3.py to generate a file list with file sizes, and run parseEntries.cTAKES.Notes.sortBySize.v3.py to sort the files by size.

Next, run setup_cTAKES.sh to create cTAKES instances, and update run_cTAKES.radiology.sh to have the correct number of cTAKES instances, then run run_cTAKES.radiology.sh. Rerun run_cTAKES.radiology.sh as needed to process all size split files.

When completed, run setup_for_flatfile_generation.py to create the directory structure indicated above for flatfile processing and to create the required input files for flatfile processing. Finally, run generate_flatfiles.v2_2d.py to generate flatfiles from the compressed XMI files.