4
TABLE II: Different classifiers used in the study
Code Classifier Parameters
SVM-L SVM kernel linear, C 1.0, shrinking true
SVM-RBF SVM kernel rbf, C 1.0, shrinking: true
LR Logistic Regression penalty l2, solver liblinear, C 1.0
XGBoost XGBoost max depth 6, learning rate 0.3
TABLE III: Final train and test performance of classification
models
Classifier Code
Micro F-score Macro F-score
Train Test Train Test
SVM-L 0.95 0.87 0.28 0.24
SVM-RBM 0.80 0.75 0.19 0.18
LR 0.82 0.78 0.20 0.18
XGBoost 0.99 0.92 0.64 0.31
B. Experiment Series 2
Figure 1 shows the top 50 keywords highlighted using TF-
IDF and LDA. The proposed approach has performed well
in highlighting the important regions, for example the topic
highlighted with a red color containing “presence range tumor
necrosis” provides useful biomarker information to readers.
Top 10 Keywords
1. Epithelial (0.377), 2. Presence (0.269), 3. Thymectomy (0.232)
4. Epithelial cells (0.210), 5. Cells (0.180), 6. Small (0.161), 7. Lobulated (0.161)
8. Lung parenchyma (0.151), 9. Appear (0.150), 10. Examination (0.139)
Topic # Keywords
Topic 1
examination, thymectomy, measuring, resection, showing,
inflammatory, lymph, spaces, lung, immunohistochemistry,
node, modified, complete
Topic 2
samples, epithelial, proliferation, mixed, CD20, CD5,
histological, according, classification, masaoka
Topic 3
lobulated, necrotic, small, right, architecture,
presence, medulla, range, tumor, necrosis, green, appear,
parenchyma, cells, cytokeratin, right, major
Fig. 1: The top 50 keywords in a report identified using TF-
IDF weights. The keywords are color encoded as per the
abstract “topics” extracted using LDA. Each topic is given
a separate color scheme.
C. Conclusions
We proposed a simple yet efficient TF-IDF method to ex-
tract and corroborate useful keywords from pathology cancer
reports. Encoding a pathology report for cancer and tumor
surveillance is a laborious task, and sometimes it is subjected
to human errors and variability in the interpretation. One of
the most important aspects of encoding a pathology report
involves extracting the primary diagnosis. This may be very
useful for content-based image retrieval to combine with visual
information. We used existing classification model and TF-
IDF features to predict the primary diagnosis. We achieved
up to 92% accuracy using XGBoost classifier. The prediction
accuracy empowers the adoption of machine learning methods
for automated information extraction from pathology reports.
REFERENCES
[1] S. Gao, M. T. Young, J. X. Qiu, H.-J. Yoon, J. B. Christian, P. A. Fearn,
G. D. Tourassi, and A. Ramanthan, “Hierarchical attention networks for
information extraction from cancer pathology reports,”
[2] R. Weegar, J. F. Nyg
˚
ard, and H. Dalianis, “Efficient Encoding of
Pathology Reports Using Natural Language Processing.,” in RANLP,
pp. 778–783.
[3] D. N. Louis, H. Ohgaki, O. D. Wiestler, W. K. Cavenee, P. C. Burger,
A. Jouvet, B. W. Scheithauer, and P. Kleihues, “The 2007 who classifica-
tion of tumours of the central nervous system,” Acta neuropathologica,
vol. 114, no. 2, pp. 97–109, 2007.
[4] N. Kang, B. Singh, Z. Afzal, E. M. van Mulligen, and J. A. Kors, “Using
rule-based natural language processing to improve disease normalization
in biomedical text,” Journal of the American Medical Informatics
Association, vol. 20, no. 5, pp. 876–881, 2012.
[5] A. E. Wieneke, E. J. Bowles, D. Cronkite, K. J. Wernli, H. Gao,
D. Carrell, and D. S. Buist, “Validation of natural language processing
to extract breast cancer pathology procedures and results,” Journal of
pathology informatics, vol. 6, 2015.
[6] T. D. Imler, J. Morea, C. Kahi, and T. F. Imperiale, “Natural language
processing accurately categorizes findings from colonoscopy and pathol-
ogy reports,” Clinical Gastroenterology and Hepatology, vol. 11, no. 6,
pp. 689–694, 2013.
[7] R. S. Crowley, M. Castine, K. Mitchell, G. Chavan, T. McSherry, and
M. Feldman, “caties: a grid based system for coding and retrieval of sur-
gical pathology reports and tissue specimens in support of translational
research,” Journal of the American Medical Informatics Association,
vol. 17, no. 3, pp. 253–264, 2010.
[8] P. Contiero, A. Tittarelli, A. Maghini, S. Fabiano, E. Frassoldi, E. Costa,
D. Gada, T. Codazzi, P. Crosignani, R. Tessandori, et al., “Comparison
with manual registration reveals satisfactory completeness and efficiency
of a computerized cancer registration system,” Journal of biomedical
informatics, vol. 41, no. 1, pp. 24–32, 2008.
[9] L. W. D’avolio, T. M. Nguyen, W. R. Farwell, Y. Chen, F. Fitzmeyer,
O. M. Harris, and L. D. Fiore, “Evaluation of a generalizable approach
to clinical information retrieval using the automated retrieval console
(arc),” Journal of the American Medical Informatics Association, vol. 17,
no. 4, pp. 375–382, 2010.
[10] R. L. Grossman, A. P. Heath, V. Ferretti, H. E. Varmus, D. R. Lowy,
W. A. Kibbe, and L. M. Staudt, “Toward a shared vision for cancer
genomic data,” New England Journal of Medicine, vol. 375, no. 12,
pp. 1109–1112, 2016.
[11] E. Loper and S. Bird, “NLTK: The Natural Language Toolkit,” in
Proceedings of the ACL-02 Workshop on Effective Tools and Method-
ologies for Teaching Natural Language Processing and Computational
Linguistics - Volume 1, ETMTNLP ’02, pp. 63–70, Association for
Computational Linguistics.
[12] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022,
2003.
[13] J. X. Qiu, H. Yoon, P. A. Fearn, and G. D. Tourassi, “Deep Learning
for Automated Extraction of Primary Sites From Cancer Pathology
Reports,” vol. 22, no. 1, pp. 244–251.