62 resultados para Classification (of information)
Resumo:
Background and aims: Machine learning techniques for the text mining of cancer-related clinical documents have not been sufficiently explored. Here some techniques are presented for the pre-processing of free-text breast cancer pathology reports, with the aim of facilitating the extraction of information relevant to cancer staging.
Materials and methods: The first technique was implemented using the freely available software RapidMiner to classify the reports according to their general layout: ‘semi-structured’ and ‘unstructured’. The second technique was developed using the open source language engineering framework GATE and aimed at the prediction of chunks of the report text containing information pertaining to the cancer morphology, the tumour size, its hormone receptor status and the number of positive nodes. The classifiers were trained and tested respectively on sets of 635 and 163 manually classified or annotated reports, from the Northern Ireland Cancer Registry.
Results: The best result of 99.4% accuracy – which included only one semi-structured report predicted as unstructured – was produced by the layout classifier with the k nearest algorithm, using the binary term occurrence word vector type with stopword filter and pruning. For chunk recognition, the best results were found using the PAUM algorithm with the same parameters for all cases, except for the prediction of chunks containing cancer morphology. For semi-structured reports the performance ranged from 0.97 to 0.94 and from 0.92 to 0.83 in precision and recall, while for unstructured reports performance ranged from 0.91 to 0.64 and from 0.68 to 0.41 in precision and recall. Poor results were found when the classifier was trained on semi-structured reports but tested on unstructured.
Conclusions: These results show that it is possible and beneficial to predict the layout of reports and that the accuracy of prediction of which segments of a report may contain certain information is sensitive to the report layout and the type of information sought.
Resumo:
Logistic regression and Gaussian mixture model (GMM) classifiers have been trained to estimate the probability of acute myocardial infarction (AMI) in patients based upon the concentrations of a panel of cardiac markers. The panel consists of two new markers, fatty acid binding protein (FABP) and glycogen phosphorylase BB (GPBB), in addition to the traditional cardiac troponin I (cTnI), creatine kinase MB (CKMB) and myoglobin. The effect of using principal component analysis (PCA) and Fisher discriminant analysis (FDA) to preprocess the marker concentrations was also investigated. The need for classifiers to give an accurate estimate of the probability of AMI is argued and three categories of performance measure are described, namely discriminatory ability, sharpness, and reliability. Numerical performance measures for each category are given and applied. The optimum classifier, based solely upon the samples take on admission, was the logistic regression classifier using FDA preprocessing. This gave an accuracy of 0.85 (95% confidence interval: 0.78-0.91) and a normalised Brier score of 0.89. When samples at both admission and a further time, 1-6 h later, were included, the performance increased significantly, showing that logistic regression classifiers can indeed use the information from the five cardiac markers to accurately and reliably estimate the probability AMI. © Springer-Verlag London Limited 2008.
Resumo:
Breast cancer remains a frequent cause of female cancer death despite the great strides in elucidation of biological subtypes and their reported clinical and prognostic significance. We have defined a general cohort of breast cancers in terms of putative actionable targets, involving growth and proliferative factors, the cell cycle, and apoptotic pathways, both as single biomarkers across a general cohort and within intrinsic molecular subtypes.
We identified 293 patients treated with adjuvant chemotherapy. Additional hormonal therapy and trastuzumab was administered depending on hormonal and HER2 status respectively. We performed immunohistochemistry for ER, PR, HER2, MM1, CK5/6, p53, TOP2A, EGFR, IGF1R, PTEN, p-mTOR and e-cadherin. The cohort was classified into luminal (62%) and non-luminal (38%) tumors as well as luminal A (27%), luminal B HER2 negative (22%) and positive (12%), HER2 enriched (14%) and triple negative (25%). Patients with luminal tumors and co-overexpression of TOP2A or IGF1R loss displayed worse overall survival (p=0.0251 and p=0.0008 respectively). Non-luminal tumors had much greater heterogeneous expression profiles with no individual markers of prognostic significance. Non-luminal tumors were characterised by EGFR and TOP2A overexpression, IGF1R, PTEN and p-mTOR negativity and extreme p53 expression.
Our results indicate that only a minority of intrinsic subtype tumors purely express single novel actionable targets. This lack of pure biomarker expression is particular prevalent in the triple negative subgroup and may allude to the mechanism of targeted therapy inaction and myriad disappointing trial results. Utilising a combinatorial biomarker approach may enhance studies of targeted therapies providing additional information during design and patient selection while also helping decipher negative trial results.
Resumo:
We propose a recursive method of pricing an information good in a network of holders and demanders of this good. The prices are determined via a unique equilibrium outcome in a sequence of bilateral bargaining games that are played by connected agents. If the information is an homogenous, non-depreciating good without network effects we derive explicit formulae which elucidate the role of the link pattern among the players. Particularly, we find out that the equilibrium price is intimately related to the existence of cycles in the network: It is zero if a cycle covers the trading pair and it is proportional to the direct and indirect utility that the good generates otherwise.
Resumo:
Previous studies have revealed considerable interobserver and intraobserver variation in the histological classification of preinvasive cervical squamous lesions. The aim of the present study was to develop a decision support system (DSS) for the histological interpretation of these lesions. Knowledge and uncertainty were represented in the form of a Bayesian belief network that permitted the storage of diagnostic knowledge and, for a given case, the collection of evidence in a cumulative manner that provided a final probability for the possible diagnostic outcomes. The network comprised 8 diagnostic histological features (evidence nodes) that were each independently linked to the diagnosis (decision node) by a conditional probability matrix. Diagnostic outcomes comprised normal; koilocytosis; and cervical intraepithelial neoplasia (CIN) 1, CIN II, and CIN M. For each evidence feature, a set of images was recorded that represented the full spectrum of change for that feature. The system was designed to be interactive in that the histopathologist was prompted to enter evidence into the network via a specifically designed graphical user interface (i-Path Diagnostics, Belfast, Northern Ireland). Membership functions were used to derive the relative likelihoods for the alternative feature outcomes, the likelihood vector was entered into the network, and the updated diagnostic belief was computed for the diagnostic outcomes and displayed. A cumulative probability graph was generated throughout the diagnostic process and presented on screen. The network was tested on 50 cervical colposcopic biopsy specimens, comprising 10 cases each of normal, koilocytosis, CIN 1, CIN H, and CIN III. These had been preselected by a consultant gynecological pathologist. Using conventional morphological assessment, the cases were classified on 2 separate occasions by 2 consultant and 2 junior pathologists. The cases were also then classified using the DSS on 2 occasions by the 4 pathologists and by 2 medical students with no experience in cervical histology. Interobserver and intraobserver agreement using morphology and using the DSS was calculated with K statistics. Intraobserver reproducibility using conventional unaided diagnosis was reasonably good (kappa range, 0.688 to 0.861), but interobserver agreement was poor (kappa range, 0.347 to 0.747). Using the DSS improved overall reproducibility between individuals. Using the DSS, however, did not enhance the diagnostic performance of junior pathologists when comparing their DSS-based diagnosis against an experienced consultant. However, the generation of a cumulative probability graph also allowed a comparison of individual performance, how individual features were assessed in the same case, and how this contributed to diagnostic disagreement between individuals. Diagnostic features such as nuclear pleomorphism were shown to be particularly problematic and poorly reproducible. DSSs such as this therefore not only have a role to play in enhancing decision making but also in the study of diagnostic protocol, education, self-assessment, and quality control. (C) 2003 Elsevier Inc. All rights reserved.
Resumo:
Before the emergence of coordination of production by firms, manufacturers and merchants traded in markets with asymmetric information. Evidence suggests that the practical knowledge thus gained by these agents was well in advance of contemporary political economists and anticipates twentieth-century developments in the economics of information. Charles Babbage, who regarded merchants and manufacturers as the chief sources of reliable economic data, drew on this knowledge as revealed in the evidence of manufacturers and merchants presented to House of Commons select committees to make an important pioneering contribution to the theory of production and exchange with information asymmetries.