Generation of electronic case report forms from pathology reports by ARGO, automatic record generator for onco-hematology
A total of 332 paper histopathology reports were collected between 2014 and 2020 at the pathology unit of the IRCCS Istituto Tumori ‘Giovanni Paolo II’ in Bari, Italy (239) and from six different Italian centers (93) of the Hematology Unit, Azienda Ospedaliero-Universitaria Policlinico Umberto I in Rome, Italy, Hematology, AUSL / IRCCS of Reggio Emilia in Reggio Emilia, Italy, Division of Hematology 1, AOU “Città della Salute e della Scienza di Torino »In Turin, Italy, Division of Hematology, Azienda Ospedaliero-Universitaria Maggiore della Carità di Novara in Novara, Italy, Department of Medicine, Section of Hematology, University of Verona in Verona, Italy, and Division of Diagnostic Hematopathology, IRCCS European Institute of Oncology in Milan, Italy. The internal series included 106 DLBCL, 79 FL and 54 MCL, while the external series included 49 DLBCL, 24 FL and 20 MCL.
A unique identification code has been assigned to each report. Depending on the diagnostic criteria for each lymphoma subtype, reports included IHC results obtained from LN, EN, BM, or PB samples. Qualitative and quantitative information for IHC markers including MYC, BCL2, BCL6, CD10, CD20, Cyclin-D1, has been reported. Some reports also included molecular data from FISH analysis, while others included FISH results or the level of tumor cell infiltration as an additive. For DLBCL, molecular classification according to COO estimated by Hans’s algorithm has also been included24. Ki-67 the proliferation index has also been reported as a quantitative value ranging from 5 to 100%.
The work was approved by the Institutional Review Board of the IRCCS Istituto Tumori “Giovanni Paolo II” Hospital in Bari, Italy. All methods were performed in accordance with applicable local regulations and after obtaining dedicated informed consent.
Automated detection of relevant terms in paper reports
We aimed at this stage of the workflow to automate the detection of relevant terms to be extracted from text fields of paper reports. ARGO operates OCR25 and NLP26 techniques for converting report images into text and detecting relevant words in the text based on an “ad-hoc” thesaurus.
Image to text conversion has been implemented in Tesseract OCR© (version 4.1.1-rc2-20-g01fb). To improve conversion performance, each pathology report was first converted from pdf to image via the Poppler library (version 0.26.5). Then, the image was translated into an 8-bit gray scale (0 to 255 gray levels).
Image transformation was developed in Python by OpenCV © software (version 4.2.0).
In ARGO, NLP techniques were adopted to automatically extract terms relevant to the diagnosis of the disease, to be transferred into the scanned eCRFs. Thus, a set of NLP regular expressions was applied to extract information regarding diagnosis, report date, report ID, sample type, performance of BM, IHC, and biopsy scans. FISH, as well as quantitative and qualitative data from IHC Markers (MYC, BCL2, BCL6, CD10, CD20, Cyclin-D1), COO subtypes and Ki-67 proliferation index (paragraph “ARGO function and NLP rules“).
Disease nomenclature was assigned based on the highest match between the pattern of biomarkers detected in each report and a reference pattern, as outlined in the “Hematopoietic and Lymphoid Neoplasms Coding Manual Guidelines” of the “Program”. Surveillance, Epidemiology and End Results (SEER) ”from the National Institute of Health27. The nomenclature of the final diagnosis was referred to the CIM10 classification23. The communication between the official ARGO and SEER servers was handled flexibly through the API.
ARGO was developed in Flask©, version 1.1.2, the web server was an Oracle© Linux 7.8 server with kernel 4.14.35-1902.303.5.3.el7uek.x86_64. We used MariaDB© 5.5.68 as a database. NLP algorithms were developed in Python 3.6.8. The translation from English to Italian was done via the API tool MyMemory© (version 3.5.0). To increase the detectability of biomarkers in reports, we also constructed three thesauri in Phyton with NLP regular expressions (source code of Supplementary Appendix S1 and Table S2). Despite the domain specificity of such thesauri, the technique of knowledge extraction by flexibly introducing a new thesaurus is a general feature of ARGO.
ARGO functions and NLP rules
ARGO has been developed according to three functions: play_function.py, header_info.py, and params.py. Function_read.py was the main function and included (1) the call to the header_info.py to recognize the input report pattern, (2) the set of NLP expressions to identify both the biomarker and the diagnostic description, and (3) the call to the params.py feature that included two API tokens, the first to take biomarker and diagnostic data from the SEER database and the second provided from the REDCap project ID to enable automatic data entry. Supplementary figure S2A details the pseudocode for processing a pathology report. ARGO has integrated two main activities, namely i) the recognition of the model from the header section including the fields “BIOPSY DATE ” and “ID NUMBER”, patient demographic information (“NAME”, “NAME “,” DATE OF BIRTH “,” PLACE OF BIRTH “,” SEX “, and “SSN “ [Social Security Number]), and the “SAMPLE TYPE ‘ (Going through header_info.py), and ii) the recognition of “IHC MARKERS ” (“POSITIVITY / NEGATIVITY” or “QUANTITY”) from the biological samples, the fields “FISH”, “DIAGNOSTIC”, and “ORIGINAL CELL ” from the disease section (via play_function.py). Supplementary figure S2B shows an example of an NLP entry from the internal series. The regular expressions used to automatically recognize the header section of internal reports are shown in Table 4. Those of external reports are detailed in Supplementary Table S3.
Regarding play_function.py, we identified all the pathological description patterns according to the following four scenarios:
description of qualitative markers by symbolic qualifiers in the form of free text (for example “+” for positivity and “-” for negativity);
description of qualitative markers by textual qualifiers in free text form (eg “positive”, “reactive” or “immunoreactive” for positivity and “negative” or “immunonegative” for negativity);
description of qualitative and quantitative markers by symbolic or textual qualifiers in the form of bullets;
description of pure quantitative markers (such as Ki-67).
Table 5 shows three representative models of description with their relative NLP pseudocodes and expected results. All the models are detailed in the supplementary table S4.
Data mapping and automatic eCRF filling
For a systematic collection of diagnostic variables in this study, we designed dedicated eCRFs on REDcap17.18. The eCRFs were adapted to the synoptic models provided and approved by the CAP. We have referred to the DLBCL, FL and MCL models28.29. The data mapping between ARGO and eCRFs was done by providing the relevant data fields of the REDCap dictionary as flexible input for the application (Supplementary Table S5). Finally, we used API technology for automatic data entry and the final upload of information of interest into eCRFs.
ARGO performance, considered the level of consistency between data included in the original pathology reports and that automatically transferred to eCRFs, was assessed in terms of accuracy, precision, recall, and F1 score.30. To calculate each measure, we have defined the following cases (1) true-positive: cases in which ARGO correctly detected the expected variables; (2) false positive: cases in which ARGO has detected variables even if they are not present in the original report; (3) true-negative: cases in which ARGO did not detect a variable not present in the original report; and (4) false negative: cases in which ARGO failed to detect a variable present in the original report.
The results for each data field of the internal and external series were compared statistically by a chi-square test.