JOHN R CASKEY
cTAKES_processing

Repository



README
Requirements:

cTAKES 4.0.0.1 installed
custom NLM Library installed
python 3.6+ installed


Preprocessing Clinical Notes (Non-radiology notes) and Radiology Reports

Non-Radiology Notes
The UW non-radiology notes require preprocessing input to the cTAKES pipeline. The relationship of key columns for these data were:

Note Line, Note index -> Note ID -> Encounter ID -> Patient ID

Multiple note IDs existed for a single note that was split into parts defined by a Note Line and a Note Index, one or more note IDs corresponded to an encounter, and one or more encounters corresponded to a patient. When viewing the above relationship in reverse, it was possible to establish and use unique identifiers for one or more tasks; for example, each patient ID was unique for a set of encounters. The exception to this relationship is the Note Index and Note Line identifiers for a Note ID were not necessarily unique and required the pre-processing steps below.
The note indices were out of order and did not correspond to the correct ordering for a complete patient note, and frequently contained note index entries with duplicated text. A synthetic data example is in the table below and illustrates the typical ordering of notes:


Note Index
Note ID
Note Line
Encounter ID
Patient ID
Note Text


2
11112
1
E0001
P1234
of mental agitation, and initially refused examination. Physical examination


3
11112
2
E0001
P1234
The patient arrived and showed signs


1
11112
3
E0001
P1234
showed swelling and bruising on the third metacarpal of the right hand, and tenderness in the left ventral thorax.


4
11112
2
E0001
P1234
ADDENDED NOTE: The patient arrived and showed signs


5
11112
3
E0001
P1234
swelling and bruising on the third metacarpal of the right hand, and tenderness in the left ventral thorax.


6
11112
2
E0001
P1234
of mental agitation, and initially refused examination. Attending Physician Robert Smith advised drug test before proceeding with patient. Physical examination showed


In the above example, the correct order would be Note Index 3, Note Index 2, Note Index 1, then the modified addeded note ordered Note Index 4, Note Index 6, Note Index 5.
The following pattern was identified:

If there were less than 3 note index entries, then the order was usually index 2, then index 1
If there were more than 3 note index entries, then the order was usually index last index-1, then 1:end_index
Addended notes usually were the most updated and "correct" version of the notes, and followed a similar pattern of starting at last index - 1
If a note contained a modified or addended note, the note had duplicated note lines.
Duplicated note line numbering across Note IDs shared no consistent patterns.

Code is the correct, ordered, and complete medical note for each Note ID. This code has comments at the beginning of each block that describe the steps and reasoning with deduplicating and ordering. In instances where the algorithm in the code could not definitively determine the order of a note and/or which note indices were duplicated entries, it made no modifications to that note. This process was subsequently validated by visual inspection on a sampling of notes.
Additional code was required to remove the remaining duplicated note entries. The algorithm there scanned for duplicated note indices within a note using anchor text at the beginning and end of each Note Index, and if a pair of Note Indices were determined to be duplicates, it randomly selected one Note Index to discard.

Radiology Notes
The notes within the Radiology notes had two separate issues. The notes had no Patient IDs and missing timestamps.
The only identifier for the Radiology Notes was the encounter ID column, so artificial identifiers were created using a derivative of the encounter ID. An example is shown below:


Encounter ID
Order Time
Perform Time
Result Time
Note Text
Generated Unique Note ID


E0001
04-28-2022 10:59:02


XRay film of right wrist
E0001.1


E0001
04-28-2022 11:59:02
04-28-2022 12:37:41

XRay film of chest
E0001.2


E0002
04-26-2022 09:39:05


CT Scan of abdomen
E0002.1


cTAKES processing requires unique identifiers for each note, so the Generated Unique Note IDs sufficed adequately. For patient-level analyses, it will be necessary to map each radiology encounter ID to an encounter ID:Patient ID in the non-radiology notes dataset; however, this assumes that all patient IDs within the non-radiology notes dataset will have a corresponding radiology notes encounter ID that is present within the non-radiology notes.

Timestamps
As shown in the previous bullet point, the Order Time field was the only field that contained timestamps for every note. The stats for missing timestamps in the Perform Time and Result Time have been uploaded to this repository. This requires inspection from source data at data warehouse/Clarity.
Additionally, the non-radiology notes had only one timestamp field, which also contained missing timestamp values. The stats for these missing timestamps have also been uploaded to this repository.

cTAKES NLP Pipeline Overview
Following the preprocessing steps listed above, the medical notes were ready for the processing and analysis steps in the cTAKES overview.
The cTAKES workflow takes as input a (large) input CSV file, splits it into individual files, then sorts the files by size for processing. The size bins are:

< 10kb ("small")
>= 10 kb and < 20kb ("medium")
>= 20 kb and < 40 kb ("large")
>= 40 kb ("super-large")

The size of files for each bin was determined by previous work by Majid, and from recent analyses that showed exponential time complexity for the size of files being processed by cTAKES. Generally speaking, if you have limited CPU resources, the "super-large" and "large" bins of files should be completed before the small and medium size bins.
The bins of files are further split to have no more than 25000 files each, which allows streamlined parallel processing. For example, if there were 100,000 files in a dataset, where 10,000 were medium size, 10,000 were large size, 5,000 were super-large size, and 75,000 were small size, the resulting split directory structure would look like this:
small_size/
    small_size_1
    small_size_2
    small_size_3
medium_size/
    medium_size_1
large_size/
    large_size_1
superlarge_size/
    superlarge_size_1
The python script will create this directory structure in the work directory, and the parallel bash script will look for this directory structure when starting cTAKES instances. After splitting the files, the parallel bash script will start cTAKES processing on the split files.
After completing cTAKES processing, the output files will be in compressed format (*.xmi.gz) in the corresponding output directories of the cTAKES instances. They need to be copied to the flatfile generator folder, which then will convert the XMI output to ||-delimited text flatfiles. The output flatfiles can be over 50 GB in size, so it may be advisable to compress these.

Usage

Note1 from John (03-24-2022): The code in the code/ directory was used to process the radiology notes from start to finish, hence the naming. I'll be updating this over the next few days, but the basic steps and usage will not change much.

First, run parse_radiology_notes.py to split the initial CSV file. Update the NOTES_FILE= variable with the correct file name, and update PSEUDO_PAT_ENC_CSN_ID at pat_id_list = df['PSEUDO_PAT_ENC_CSN_ID'].to_list() to have a unique identifier.
Then, run parseEntries.cTAKES.Notes.getFileSizes.v3.py to generate a file list with file sizes, and run parseEntries.cTAKES.Notes.sortBySize.v3.py to sort the files by size.
Next, run setup_cTAKES.sh to create cTAKES instances, and update run_cTAKES.radiology.sh to have the correct number of cTAKES instances, then run run_cTAKES.radiology.sh. Rerun run_cTAKES.radiology.sh as needed to process all size split files.
When completed, run setup_for_flatfile_generation.py to create the directory structure indicated above for flatfile processing and to create the required input files for flatfile processing. Finally, run generate_flatfiles.v2_2d.py to generate flatfiles from the compressed XMI files.