Classification of cancer-related death certificates using machine learning


Autoria(s): Butt, Luke; Zuccon, Guido; Nguyen, Anthony; Bergheim, Anton; Grayson, Narelle
Data(s)

2013

Resumo

Background Cancer monitoring and prevention relies on the critical aspect of timely notification of cancer cases. However, the abstraction and classification of cancer from the free-text of pathology reports and other relevant documents, such as death certificates, exist as complex and time-consuming activities. Aims In this paper, approaches for the automatic detection of notifiable cancer cases as the cause of death from free-text death certificates supplied to Cancer Registries are investigated. Method A number of machine learning classifiers were studied. Features were extracted using natural language techniques and the Medtex toolkit. The numerous features encompassed stemmed words, bi-grams, and concepts from the SNOMED CT medical terminology. The baseline consisted of a keyword spotter using keywords extracted from the long description of ICD-10 cancer related codes. Results Death certificates with notifiable cancer listed as the cause of death can be effectively identified with the methods studied in this paper. A Support Vector Machine (SVM) classifier achieved best performance with an overall F-measure of 0.9866 when evaluated on a set of 5,000 free-text death certificates using the token stem feature set. The SNOMED CT concept plus token stem feature set reached the lowest variance (0.0032) and false negative rate (0.0297) while achieving an F-measure of 0.9864. The SVM classifier accounts for the first 18 of the top 40 evaluated runs, and entails the most robust classifier with a variance of 0.001141, half the variance of the other classifiers. Conclusion The selection of features significantly produced the most influences on the performance of the classifiers, although the type of classifier employed also affects performance. In contrast, the feature weighting schema created a negligible effect on performance. Specifically, it is found that stemmed tokens with or without SNOMED CT concepts create the most effective feature when combined with an SVM classifier.

Identificador

http://eprints.qut.edu.au/70154/

Publicador

Australasian Medical Journal Pty. Ltd

Relação

http://www.amj.net.au/index.php?journal=AMJ&page=article&op=view&path%5B%5D=1654

DOI:10.4066/AMJ.2013.1654

Butt, Luke, Zuccon, Guido, Nguyen, Anthony, Bergheim, Anton, & Grayson, Narelle (2013) Classification of cancer-related death certificates using machine learning. Australasian Medical Journal, 6(5), pp. 292-299.

Direitos

Copyright 2013 AMJ

Fonte

School of Information Systems; Science & Engineering Faculty

Palavras-Chave #Death certificates #cancer registry #cancer monitoring and reporting #machine learning #natural language processing #SNOMED CT
Tipo

Journal Article