Automatic Classification of Pathology Reports using TF-IDF Features

Automatic Classiﬁcation of Pathology Reports

using TF-IDF Features

Shivam Kalra, Larry Li, H.R. Tizhoosh

Kimia Lab, University of Waterloo, Canada, http://[email protected]

Abstract—Pathology report is arguably one of the most impor-

tant documents in medicine containing interpretive information

about the visual ﬁndings from the patient’s biopsy sample. Each

pathology report has a retention period of up to 20 years after

the treatment of a patient. Cancer registries process and encode

high volumes of free-text pathology reports for surveillance of

cancer and tumor diseases all across the world. In spite of their

extremely valuable information they hold, pathology reports are

not used in any systematic way to facilitate computational pathol-

ogy. Therefore, in this study we investigate automated machine-

learning techniques to identify/predict the primary diagnosis

(based on ICD-O code) from pathology reports. We performed

experiments by extracting the TF-IDF features from the reports

and classifying them using three different methods—SVM, XG-

Boost, and Logistic Regression. We constructed a new dataset

with 1,949 pathology reports arranged into 37 ICD-O categories,

collected from four different primary sites, namely lung, kidney,

thymus, and testis. The reports were manually transcribed into

text format after collecting them as PDF ﬁles from NCI Genomic

Data Commons public dataset. We subsequently pre-processed

the reports by removing irrelevant textual artefacts produced by

OCR software. The highest classiﬁcation accuracy we achieved

was 92% using XGBoost classiﬁer on TF-IDF feature vectors, the

linear SVM scored 87% accuracy. Furthermore, the study shows

that TF-IDF vectors are suitable for highlighting the important

keywords within a report which can be helpful for the cancer

research and diagnostic workﬂow. The results are encouraging

in demonstrating the potential of machine learning methods for

classiﬁcation and encoding of pathology reports.

I. INTRODUCTION

Cancer is one of the leading causes of death in the world,

with over 80,000 deaths registered in Canada in 2017 (Cana-

dian Cancer Statistics 2017). A computer-aided system for

cancer diagnosis usually involves a pathologist rendering a de-

scriptive report after examining the tissue glass slides obtained

from the biopsy of a patient. A pathology report contains

speciﬁc analysis of cells and tissues, and other histopatholog-

ical indicators that are crucial for diagnosing malignancies.

An average sized laboratory may produces a large quantity

of pathology reports annually (e.g., in excess of 50,000), but

these reports are written in mostly unstructured text and with

no direct link to the tissue sample. Furthermore, the report for

each patient is a personalized document and offers very high

variability in terminology due to lack of standards and may

even include misspellings and missing punctuation, clinical

diagnoses interspersed with complex explanations, different

terminology to label the same malignancy, and information

about multiple carcinoma appearances included in a single

report [1].

In Canada, each Provincial and Territorial Cancer Registry

(PTCR) is responsible for collecting the data about cancer

diseases and reporting them to Statistics Canada (StatCan).

Every year, Canadian Cancer Registry (CCR) uses the in-

formation sources of StatCan to compile an annual report

on cancer and tumor diseases. Many countries have their

own cancer registry programs. These programs rely on the

acquisition of diagnostic, treatment, and outcome information

through manual processing and interpretation from various un-

structured sources (e.g., pathology reports, autopsy/laboratory

reports, medical billing summaries). The manual classiﬁcation

of cancer pathology reports is a challenging, time-consuming

task and requires extensive training [1].

With the continued growth in the number of cancer patients,

and the increase in treatment complexity, cancer registries

face a signiﬁcant challenge in manually reviewing the large

quantity of reports [1], [2]. In this situation, Natural Language

Processing (NLP) systems can offer a unique opportunity

to automatically encode the unstructured reports into struc-

tured data. Since, the registries already have access to the

large quantity of historically labeled and encoded reports, a

supervised machine learning approach of feature extraction

and classiﬁcation is a compelling direction for making their

workﬂow more effective and streamlined. If successful, such

a solution would enable processing reports in much lesser

time allowing trained personnel to focus on their research and

analysis. However, developing an automated solution with high

accuracy and consistency across wide variety of reports is a

challenging problem.

For cancer registries, an important piece of information in a

pathology report is the associated ICD-O code which describes

the patient’s histological diagnosis, as described by the World

Health Organization’s (WHO) International Classiﬁcation of

Diseases for Oncology [3]. Prediction of the primary diagnosis

from a pathology report provides a valuable starting point

for exploration of machine learning techniques for automated

cancer surveillance. A major application for this purpose

would be “auto-reporting” based on analysis of whole slide

images, the digitization of the biopsy glass slides. Structured,

summarized and categorized reports can be associated with

the image content when searching in large archives. Such as

system would be able to drastically increase the efﬁciency of

diagnostic processes for the majority of cases where in spite

of obvious primary diagnosis, still time and effort is required

from the pathologists to write a descriptive report.

The primary objective of our study is to analyze the efﬁcacy

of existing machine learning approaches for the automated

arXiv:1903.07406v1 [cs.CL] 5 Mar 2019

classiﬁcation of pathology reports into different diagnosis cat-

egories. We demonstrate that TF-IDF feature vectors combined

with linear SVM or XGBoost classiﬁer can be an effective

method for classiﬁcation of the reports, achieving up to 83%

accuracy. We also show that TF-IDF features are capable

of identifying important keywords within a pathology report.

Furthermore, we have created a new dataset consisting of

1,949 pathology reports across 37 primary diagnoses. Taken

together, our exploratory experiments with a newly introduced

dataset on pathology reports opens many new opportunities for

researchers to develop a scalable and automatic information

extraction from unstructured pathology reports.

II. BACKGROUND

NLP approaches for information extraction within the

biomedical research areas range from rule-based systems [4],

to domain-speciﬁc systems using feature-based classiﬁca-

tion [2], to the recent deep networks for end-to-end feature

extraction and classiﬁcation [1]. NLP has had varied degree

of success with free-text pathology reports [5]. Various studies

have acknowledge the success of NLP in interpreting pathol-

ogy reports, especially for classiﬁcation tasks or extracting a

single attribute from a report [5], [6].

The Cancer Text Information Extraction System

(caTIES) [7] is a framework developed in a caBIG project

focuses on information extraction from pathology reports.

Speciﬁcally, caTIES extracts information from surgical

pathology reports (SPR) with good precision as well as recall.

Another system known as Open Registry [8] is capable

of ﬁltering the reports with disease codes containing cancer.

In [9], an approach called Automated Retrieval Console (ARC)

is proposed which uses machine learning models to predict the

degree of association of a given pathology or radiology with

the cancer. The performance ranges from an F-measure of 0.75

for lung cancer to 0.94 for colorectal cancer. However, ARC

uses domain-speciﬁc rules which hiders with the generaliza-

tion of the approach to variety of pathology reports.

This research work is inspired by themes emerging in

many of the above studies. Speciﬁcally, we are evaluating the

task of predicting the primary diagnosis from the pathology

report. Unlike previous approaches, the system does not rely

on custom rule-based knowledge, domain speciﬁc features,

balanced dataset with fewer number of classes.

III. MATERIALS AND METHODS

We assembled a dataset of 1,949 cleaned pathology reports.

Each report is associated with one of the 37 different primary

diagnoses based on IDC-O codes. The reports are collected

from four different body parts or primary sites from multiple

patients. The distribution of reports across different primary

diagnoses and primary sites is reported in Table I. The dataset

was developed in three steps as follows.

Collecting pathology reports: The total of 11,112 pathol-

ogy reports were downloaded from NCI’s Genomic Data

Commons (GDC) dataset in PDF format [10]. Out of all PDF

ﬁles, 1,949 reports were selected across multiple patients from

four speciﬁc primary sites—thymus, testis, lung, and kidney.

TABLE I: Distribution of pathology reports across (a) Primary

diagnosis, used a the label for the study, and (b) Primary site

associated with a report.

(a) Primary Diagnosis

Description Count

Clear cell adenocarcinoma, NOS 523

Squamous cell carcinoma, NOS 340

Papillary adenocarcinoma, NOS 300

Adenocarcinoma, NOS 233

Renal cell carcinoma, chromophobe type 113

Adenocarcinoma with mixed subtypes 89

Seminoma, NOS 68

Thymoma, type AB, malignant 31

Mixed germ cell tumor 30

Thymoma, type B2, malignant 26

Embryonal carcinoma, NOS 26

Thymoma, type A, malignant 15

Renal cell carcinoma, NOS 14

Thymoma, type B1, malignant 13

Bronchiolo-alveolar carcinoma, non-mucinous 13

Thymoma, type B3, malignant 13

Acinar cell carcinoma 13

Mucinous adenocarcinoma 11

Thymic carcinoma, NOS 11

Basaloid squamous cell carcinoma 9

Thymoma, type AB, NOS 7

Squamous cell carcinoma, keratinizing, NOS 7

Teratoma, benign 6

Solid carcinoma, NOS 5

Thymoma, type B2, NOS 5

Yolk sac tumor 4

Papillary squamous cell carcinoma 4

Bronchiolo-alveolar adenocarcinoma, NOS 3

Bronchio-alveolar carcinoma, mucinous 3

Teratoma, malignant, NOS 3

Micropapillary carcinoma, NOS 2

Thymoma, type A, NOS 2

Teratocarcinoma 2

Squamous cell carcinoma, large cell, nonkeratinizing 2

Thymoma, type B1, NOS 1

Squamous cell carcinoma, small cell, nonkeratinizing 1

Signet ring cell carcinoma 1

(b) Primary Site

Kidney 937

Lung 749

Testis 139

Thymus 124

The selection was primarily made based on the quality of PDF

ﬁles.

Cleaning reports: The next step was to extract the text

content from these reports. Due to the signiﬁcant time expense

of manually re-typing all the pathology reports, we developed

a new strategy to prepare our dataset. We applied an Optical

Character Recognition (OCR) software to convert the PDF

reports to text ﬁles. Then, we manually inspected all generated

text ﬁles to ﬁx any grammar/spelling issues and irrelevant

characters as an artefact produced by the OCR system.

Splitting into training-testing data: We split the cleaned

reports into 70% and 30% for training and testing, respectively.

This split resulted in 1,364 training, and 585 testing reports.

A. Pre-Processing of Reports

We pre-processed the reports by setting their text content

to lowercase and ﬁltering out any non-alphanumeric charac-

ters. We used NLTK library to remove stopping words, e.g.,

‘the’, ‘an’, ‘was’, ‘if’ and so on [11]. We then analyzed the

reports to ﬁnd common bigrams, such as “lung parenchyma”,

“microscopic examination”, “lymph node” etc. We joined the

biagrams with a hyphen, converting them into a single word.

We further removed the words that occur less than 2% in each

of the diagnostic category. As well, we removed the words that

occur more than 90% across all the categories. We stored each

pre-processed report in a separate text ﬁle.

B. TF-IDF features

TF-IDF stands for Term Frequency-Inverse Document Fre-

quency, and it is a useful weighting scheme in information

retrieval and text mining. TF-IDF signiﬁes the importance of a

term in a document within a corpus. It is important to note that

a document here refers to a pathology report, a corpus refers

to the collection of reports, and a term refers to a single word

in a report. The TF-IDF weight for a term t in a document d

is given by

T F (t, d) =

Number of times t appears in d

Total number terms in d

IDF (t) = log



Total number of documents

Number of documents with t



T F -IDF (t, d) = T F (t, d) ∗ IDF (t)

(1)

We performed the following steps to transform a pathology

report into a feature vector:

1) Create a set of vocabulary containing all unique words

from all the pre-processed training reports.

2) Create a zero vector f

of the same length as the

vocabulary.

3) For each word t in a report r, set the corresponding

index in f

to T F − IDF (t, r).

4) The resultant f

is a feature vector for the report r and

it is a highly sparse vector.

C. Keyword extraction and topic modelling

The keyword extraction involves identifying important

words within reports that summarizes its content. In contrast,

the topic modelling allows grouping these keywords using an

intelligent scheme, enabling users to further focus on certain

aspects of a document. All the words in a pathology report are

sorted according to their TF-IDF weights. The top n sorted

words constitute the top n keywords for the report. The n

is empirically set to 50 within this research. The extracted

keywords are further grouped into different topics by using

latent Dirichlet allocation (LDA) [12]. The keywords in a

report are highlighted using the color theme based on their

topics.

D. Evaluation metrics

Each model is evaluated using two standard NLP metrics—

micro and macro averaged F-scores, the harmonic mean of

related metrics precision and recall. For each diagnostic cate-

gory C

from a set of 37 different classes C, the number of

true positives T P

, false positives F P

, and false negatives

F N

, the micro F-score is deﬁned as

micro

T P

(T P

+ F P

)

micro

T P

(T P

+ F N

)

micro

+ R

micro

(2)

whereas macro F-score is given by

macro

|C|

F (C

). (3)

In summary, micro-averaged metrics have class representa-

tion roughly proportional to their test set representation (same

as accuracy for classiﬁcation problem with a single label per

data point), whereas macro-averaged metrics are averaged by

class without weighting by class prevalence [13].

E. Experimental setting

In this study, we performed two different series of experi-

ments: i) evaluating the performance of TF-IDF features and

various machine learning classiﬁers on the task of predicting

primary diagnosis from the text content of a given report,

and ii) using TF-IDF and LDA techniques to highlight the

important keywords within a report. For the ﬁrst experiment

series, training reports are pre-processed, then their TF-IDF

features are extracted. The TF-IDF features and the training

labels are used to train different classiﬁcation models. These

different classiﬁcation models and their hyper-parameters are

reported in Table II. The performance of classiﬁers is measured

quantitatively on the test dataset using the evaluation metrics

discussed in the previous section. For the second experiment

series, a random report is selected and its top 50 keywords

are extracted using TF-IDF weights. These 50 keywords are

highlighted using different colors based on their associated

topic, which are extracted through LDA. A non-expert based

qualitative inspection is performed on the extracted keywords

and their corresponding topics.

IV. RESULTS AND DISCUSSION

A. Experiment Series 1

A classiﬁcation model is trained to predict the primary

diagnosis given the content of the cancer pathology report.

The performance results on this task are reported in Table III.

We can observe that the XGBoost classiﬁer outperformed all

other models for both the micro F-score metric, with a score

of 0.92, and the macro F-score metric, with a score of 0.31.

This was an improvement of 7% for the micro F-score over

the next best model, SVM-L, and a marginal improvement

of 5% for macro F-score. It is interesting to note that SVM

with linear kernels performs much better than SVM with RBF

kernel, scoring 9% on the macro F-score and 12% more on

the micro F-score. It is suspected that since words used in

primary diagnosis itself occur in some reports, thus enabling

the linear models to outperform complex models.

TABLE II: Different classiﬁers used in the study

Code Classiﬁer Parameters

SVM-L SVM kernel linear, C 1.0, shrinking true

SVM-RBF SVM kernel rbf, C 1.0, shrinking: true

LR Logistic Regression penalty l2, solver liblinear, C 1.0

XGBoost XGBoost max depth 6, learning rate 0.3

TABLE III: Final train and test performance of classiﬁcation

models

Classiﬁer Code

Micro F-score Macro F-score

Train Test Train Test

SVM-L 0.95 0.87 0.28 0.24

SVM-RBM 0.80 0.75 0.19 0.18

LR 0.82 0.78 0.20 0.18

XGBoost 0.99 0.92 0.64 0.31

B. Experiment Series 2

Figure 1 shows the top 50 keywords highlighted using TF-

IDF and LDA. The proposed approach has performed well

in highlighting the important regions, for example the topic

highlighted with a red color containing “presence range tumor

necrosis” provides useful biomarker information to readers.

Top 10 Keywords

1. Epithelial (0.377), 2. Presence (0.269), 3. Thymectomy (0.232)

4. Epithelial cells (0.210), 5. Cells (0.180), 6. Small (0.161), 7. Lobulated (0.161)

8. Lung parenchyma (0.151), 9. Appear (0.150), 10. Examination (0.139)

Topic # Keywords

Topic 1

examination, thymectomy, measuring, resection, showing,

inﬂammatory, lymph, spaces, lung, immunohistochemistry,

node, modiﬁed, complete

Topic 2

samples, epithelial, proliferation, mixed, CD20, CD5,

histological, according, classiﬁcation, masaoka

Topic 3

lobulated, necrotic, small, right, architecture,

presence, medulla, range, tumor, necrosis, green, appear,

parenchyma, cells, cytokeratin, right, major

Fig. 1: The top 50 keywords in a report identiﬁed using TF-

IDF weights. The keywords are color encoded as per the

abstract “topics” extracted using LDA. Each topic is given

a separate color scheme.

C. Conclusions

We proposed a simple yet efﬁcient TF-IDF method to ex-

tract and corroborate useful keywords from pathology cancer

reports. Encoding a pathology report for cancer and tumor

surveillance is a laborious task, and sometimes it is subjected

to human errors and variability in the interpretation. One of

the most important aspects of encoding a pathology report

involves extracting the primary diagnosis. This may be very

useful for content-based image retrieval to combine with visual

information. We used existing classiﬁcation model and TF-

IDF features to predict the primary diagnosis. We achieved

up to 92% accuracy using XGBoost classiﬁer. The prediction

accuracy empowers the adoption of machine learning methods

for automated information extraction from pathology reports.

REFERENCES

[1] S. Gao, M. T. Young, J. X. Qiu, H.-J. Yoon, J. B. Christian, P. A. Fearn,

G. D. Tourassi, and A. Ramanthan, “Hierarchical attention networks for

information extraction from cancer pathology reports,”

[2] R. Weegar, J. F. Nyg

ard, and H. Dalianis, “Efﬁcient Encoding of

Pathology Reports Using Natural Language Processing.,” in RANLP,

pp. 778–783.

[3] D. N. Louis, H. Ohgaki, O. D. Wiestler, W. K. Cavenee, P. C. Burger,

A. Jouvet, B. W. Scheithauer, and P. Kleihues, “The 2007 who classiﬁca-

tion of tumours of the central nervous system,” Acta neuropathologica,

vol. 114, no. 2, pp. 97–109, 2007.

[4] N. Kang, B. Singh, Z. Afzal, E. M. van Mulligen, and J. A. Kors, “Using

rule-based natural language processing to improve disease normalization

in biomedical text,” Journal of the American Medical Informatics

Association, vol. 20, no. 5, pp. 876–881, 2012.

[5] A. E. Wieneke, E. J. Bowles, D. Cronkite, K. J. Wernli, H. Gao,

D. Carrell, and D. S. Buist, “Validation of natural language processing

to extract breast cancer pathology procedures and results,” Journal of

pathology informatics, vol. 6, 2015.

[6] T. D. Imler, J. Morea, C. Kahi, and T. F. Imperiale, “Natural language

processing accurately categorizes ﬁndings from colonoscopy and pathol-

ogy reports,” Clinical Gastroenterology and Hepatology, vol. 11, no. 6,

pp. 689–694, 2013.

[7] R. S. Crowley, M. Castine, K. Mitchell, G. Chavan, T. McSherry, and

M. Feldman, “caties: a grid based system for coding and retrieval of sur-

gical pathology reports and tissue specimens in support of translational

research,” Journal of the American Medical Informatics Association,

vol. 17, no. 3, pp. 253–264, 2010.

[8] P. Contiero, A. Tittarelli, A. Maghini, S. Fabiano, E. Frassoldi, E. Costa,

D. Gada, T. Codazzi, P. Crosignani, R. Tessandori, et al., “Comparison

with manual registration reveals satisfactory completeness and efﬁciency

of a computerized cancer registration system,” Journal of biomedical

informatics, vol. 41, no. 1, pp. 24–32, 2008.

[9] L. W. D’avolio, T. M. Nguyen, W. R. Farwell, Y. Chen, F. Fitzmeyer,

O. M. Harris, and L. D. Fiore, “Evaluation of a generalizable approach

to clinical information retrieval using the automated retrieval console

(arc),” Journal of the American Medical Informatics Association, vol. 17,

no. 4, pp. 375–382, 2010.

[10] R. L. Grossman, A. P. Heath, V. Ferretti, H. E. Varmus, D. R. Lowy,

W. A. Kibbe, and L. M. Staudt, “Toward a shared vision for cancer

genomic data,” New England Journal of Medicine, vol. 375, no. 12,

pp. 1109–1112, 2016.

[11] E. Loper and S. Bird, “NLTK: The Natural Language Toolkit,” in

Proceedings of the ACL-02 Workshop on Effective Tools and Method-

ologies for Teaching Natural Language Processing and Computational

Linguistics - Volume 1, ETMTNLP ’02, pp. 63–70, Association for

Computational Linguistics.

[12] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”

Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022,

2003.

[13] J. X. Qiu, H. Yoon, P. A. Fearn, and G. D. Tourassi, “Deep Learning

for Automated Extraction of Primary Sites From Cancer Pathology

Reports,” vol. 22, no. 1, pp. 244–251.