847 resultados para classification methods
Resumo:
Background: Tuberculosis (TB) remains a public health issue worldwide. The lack of specific clinical symptoms to diagnose TB makes the correct decision to admit patients to respiratory isolation a difficult task for the clinician. Isolation of patients without the disease is common and increases health costs. Decision models for the diagnosis of TB in patients attending hospitals can increase the quality of care and decrease costs, without the risk of hospital transmission. We present a predictive model for predicting pulmonary TB in hospitalized patients in a high prevalence area in order to contribute to a more rational use of isolation rooms without increasing the risk of transmission. Methods: Cross sectional study of patients admitted to CFFH from March 2003 to December 2004. A classification and regression tree (CART) model was generated and validated. The area under the ROC curve (AUC), sensitivity, specificity, positive and negative predictive values were used to evaluate the performance of model. Validation of the model was performed with a different sample of patients admitted to the same hospital from January to December 2005. Results: We studied 290 patients admitted with clinical suspicion of TB. Diagnosis was confirmed in 26.5% of them. Pulmonary TB was present in 83.7% of the patients with TB (62.3% with positive sputum smear) and HIV/AIDS was present in 56.9% of patients. The validated CART model showed sensitivity, specificity, positive predictive value and negative predictive value of 60.00%, 76.16%, 33.33%, and 90.55%, respectively. The AUC was 79.70%. Conclusions: The CART model developed for these hospitalized patients with clinical suspicion of TB had fair to good predictive performance for pulmonary TB. The most important variable for prediction of TB diagnosis was chest radiograph results. Prospective validation is still necessary, but our model offer an alternative for decision making in whether to isolate patients with clinical suspicion of TB in tertiary health facilities in countries with limited resources.
Resumo:
Traditional supervised data classification considers only physical features (e. g., distance or similarity) of the input data. Here, this type of learning is called low level classification. On the other hand, the human (animal) brain performs both low and high orders of learning and it has facility in identifying patterns according to the semantic meaning of the input data. Data classification that considers not only physical attributes but also the pattern formation is, here, referred to as high level classification. In this paper, we propose a hybrid classification technique that combines both types of learning. The low level term can be implemented by any classification technique, while the high level term is realized by the extraction of features of the underlying network constructed from the input data. Thus, the former classifies the test instances by their physical features or class topologies, while the latter measures the compliance of the test instances to the pattern formation of the data. Our study shows that the proposed technique not only can realize classification according to the pattern formation, but also is able to improve the performance of traditional classification techniques. Furthermore, as the class configuration's complexity increases, such as the mixture among different classes, a larger portion of the high level term is required to get correct classification. This feature confirms that the high level classification has a special importance in complex situations of classification. Finally, we show how the proposed technique can be employed in a real-world application, where it is capable of identifying variations and distortions of handwritten digit images. As a result, it supplies an improvement in the overall pattern recognition rate.
Resumo:
Objective: (1) To investigate the incidence of laryngeal involvement in a large series of patients with pemphigus vulgaris, using endoscopic examination, (2) to describe the lesions, and (3) to establish a classification of laryngeal involvement in pemphigus vulgaris based on the location of the lesions. Study design: Prospective study. Methods: A total of 40 sequentially treated pemphigus vulgaris patients, diagnosed using clinical, histological and immunofluorescence criteria, were evaluated for laryngeal manifestations using endoscopic examination. The results were used to establish a graded classification of laryngeal involvement according to the location of the lesions. Results: Active laryngeal lesions (ulcers or blisters) were found in 16 patients (40 per cent). Of these, 37.5 per cent were classified as grade I, 20 per cent as grade II, 20 per cent as grade III and 17.5 per cent as grade IV. Conclusion: Laryngeal involvement is common in pemphigus vulgaris and must be considered at the point of diagnosis. Grade I lesions are the most frequent.
Resumo:
In multi-label classification, examples can be associated with multiple labels simultaneously. The task of learning from multi-label data can be addressed by methods that transform the multi-label classification problem into several single-label classification problems. The binary relevance approach is one of these methods, where the multi-label learning task is decomposed into several independent binary classification problems, one for each label in the set of labels, and the final labels for each example are determined by aggregating the predictions from all binary classifiers. However, this approach fails to consider any dependency among the labels. Aiming to accurately predict label combinations, in this paper we propose a simple approach that enables the binary classifiers to discover existing label dependency by themselves. An experimental study using decision trees, a kernel method as well as Naive Bayes as base-learning techniques shows the potential of the proposed approach to improve the multi-label classification performance.
Resumo:
Abstract Background Spotted cDNA microarrays generally employ co-hybridization of fluorescently-labeled RNA targets to produce gene expression ratios for subsequent analysis. Direct comparison of two RNA samples in the same microarray provides the highest level of accuracy; however, due to the number of combinatorial pair-wise comparisons, the direct method is impractical for studies including large number of individual samples (e.g., tumor classification studies). For such studies, indirect comparisons using a common reference standard have been the preferred method. Here we evaluated the precision and accuracy of reconstructed ratios from three indirect methods relative to ratios obtained from direct hybridizations, herein considered as the gold-standard. Results We performed hybridizations using a fixed amount of Cy3-labeled reference oligonucleotide (RefOligo) against distinct Cy5-labeled targets from prostate, breast and kidney tumor samples. Reconstructed ratios between all tissue pairs were derived from ratios between each tissue sample and RefOligo. Reconstructed ratios were compared to (i) ratios obtained in parallel from direct pair-wise hybridizations of tissue samples, and to (ii) reconstructed ratios derived from hybridization of each tissue against a reference RNA pool (RefPool). To evaluate the effect of the external references, reconstructed ratios were also calculated directly from intensity values of single-channel (One-Color) measurements derived from tissue sample data collected in the RefOligo experiments. We show that the average coefficient of variation of ratios between intra- and inter-slide replicates derived from RefOligo, RefPool and One-Color were similar and 2 to 4-fold higher than ratios obtained in direct hybridizations. Correlation coefficients calculated for all three tissue comparisons were also similar. In addition, the performance of all indirect methods in terms of their robustness to identify genes deemed as differentially expressed based on direct hybridizations, as well as false-positive and false-negative rates, were found to be comparable. Conclusion RefOligo produces ratios as precise and accurate as ratios reconstructed from a RNA pool, thus representing a reliable alternative in reference-based hybridization experiments. In addition, One-Color measurements alone can reconstruct expression ratios without loss in precision or accuracy. We conclude that both methods are adequate options in large-scale projects where the amount of a common reference RNA pool is usually restrictive.
Resumo:
Abstract Background Smear negative pulmonary tuberculosis (SNPT) accounts for 30% of pulmonary tuberculosis cases reported yearly in Brazil. This study aimed to develop a prediction model for SNPT for outpatients in areas with scarce resources. Methods The study enrolled 551 patients with clinical-radiological suspicion of SNPT, in Rio de Janeiro, Brazil. The original data was divided into two equivalent samples for generation and validation of the prediction models. Symptoms, physical signs and chest X-rays were used for constructing logistic regression and classification and regression tree models. From the logistic regression, we generated a clinical and radiological prediction score. The area under the receiver operator characteristic curve, sensitivity, and specificity were used to evaluate the model's performance in both generation and validation samples. Results It was possible to generate predictive models for SNPT with sensitivity ranging from 64% to 71% and specificity ranging from 58% to 76%. Conclusion The results suggest that those models might be useful as screening tools for estimating the risk of SNPT, optimizing the utilization of more expensive tests, and avoiding costs of unnecessary anti-tuberculosis treatment. Those models might be cost-effective tools in a health care network with hierarchical distribution of scarce resources.
Resumo:
OBJECTIVE: This study proposes a new approach that considers uncertainty in predicting and quantifying the presence and severity of diabetic peripheral neuropathy. METHODS: A rule-based fuzzy expert system was designed by four experts in diabetic neuropathy. The model variables were used to classify neuropathy in diabetic patients, defining it as mild, moderate, or severe. System performance was evaluated by means of the Kappa agreement measure, comparing the results of the model with those generated by the experts in an assessment of 50 patients. Accuracy was evaluated by an ROC curve analysis obtained based on 50 other cases; the results of those clinical assessments were considered to be the gold standard. RESULTS: According to the Kappa analysis, the model was in moderate agreement with expert opinions. The ROC analysis (evaluation of accuracy) determined an area under the curve equal to 0.91, demonstrating very good consistency in classifying patients with diabetic neuropathy. CONCLUSION: The model efficiently classified diabetic patients with different degrees of neuropathy severity. In addition, the model provides a way to quantify diabetic neuropathy severity and allows a more accurate patient condition assessment.
Resumo:
Hierarchical multi-label classification is a complex classification task where the classes involved in the problem are hierarchically structured and each example may simultaneously belong to more than one class in each hierarchical level. In this paper, we extend our previous works, where we investigated a new local-based classification method that incrementally trains a multi-layer perceptron for each level of the classification hierarchy. Predictions made by a neural network in a given level are used as inputs to the neural network responsible for the prediction in the next level. We compare the proposed method with one state-of-the-art decision-tree induction method and two decision-tree induction methods, using several hierarchical multi-label classification datasets. We perform a thorough experimental analysis, showing that our method obtains competitive results to a robust global method regarding both precision and recall evaluation measures.
Resumo:
This work proposes a system for classification of industrial steel pieces by means of magnetic nondestructive device. The proposed classification system presents two main stages, online system stage and off-line system stage. In online stage, the system classifies inputs and saves misclassification information in order to perform posterior analyses. In the off-line optimization stage, the topology of a Probabilistic Neural Network is optimized by a Feature Selection algorithm combined with the Probabilistic Neural Network to increase the classification rate. The proposed Feature Selection algorithm searches for the signal spectrogram by combining three basic elements: a Sequential Forward Selection algorithm, a Feature Cluster Grow algorithm with classification rate gradient analysis and a Sequential Backward Selection. Also, a trash-data recycling algorithm is proposed to obtain the optimal feedback samples selected from the misclassified ones.
Resumo:
In this paper,we present a novel texture analysis method based on deterministic partially self-avoiding walks and fractal dimension theory. After finding the attractors of the image (set of pixels) using deterministic partially self-avoiding walks, they are dilated in direction to the whole image by adding pixels according to their relevance. The relevance of each pixel is calculated as the shortest path between the pixel and the pixels that belongs to the attractors. The proposed texture analysis method is demonstrated to outperform popular and state-of-the-art methods (e.g. Fourier descriptors, occurrence matrix, Gabor filter and local binary patterns) as well as deterministic tourist walk method and recent fractal methods using well-known texture image datasets.
Resumo:
The strength and durability of materials produced from aggregates (e.g., concrete bricks, concrete, and ballast) are critically affected by the weathering of the particles, which is closely related to their mineral composition. It is possible to infer the degree of weathering from visual features derived from the surface of the aggregates. By using sound pattern recognition methods, this study shows that the characterization of the visual texture of particles, performed by using texture-related features of gray scale images, allows the effective differentiation between weathered and nonweathered aggregates. The selection of the most discriminative features is also performed by taking into account a feature ranking method. The evaluation of the methodology in the presence of noise suggests that it can be used in stone quarries for automatic detection of weathered materials.
Resumo:
In this thesis some multivariate spectroscopic methods for the analysis of solutions are proposed. Spectroscopy and multivariate data analysis form a powerful combination for obtaining both quantitative and qualitative information and it is shown how spectroscopic techniques in combination with chemometric data evaluation can be used to obtain rapid, simple and efficient analytical methods. These spectroscopic methods consisting of spectroscopic analysis, a high level of automation and chemometric data evaluation can lead to analytical methods with a high analytical capacity, and for these methods, the term high-capacity analysis (HCA) is suggested. It is further shown how chemometric evaluation of the multivariate data in chromatographic analyses decreases the need for baseline separation. The thesis is based on six papers and the chemometric tools used are experimental design, principal component analysis (PCA), soft independent modelling of class analogy (SIMCA), partial least squares regression (PLS) and parallel factor analysis (PARAFAC). The analytical techniques utilised are scanning ultraviolet-visible (UV-Vis) spectroscopy, diode array detection (DAD) used in non-column chromatographic diode array UV spectroscopy, high-performance liquid chromatography with diode array detection (HPLC-DAD) and fluorescence spectroscopy. The methods proposed are exemplified in the analysis of pharmaceutical solutions and serum proteins. In Paper I a method is proposed for the determination of the content and identity of the active compound in pharmaceutical solutions by means of UV-Vis spectroscopy, orthogonal signal correction and multivariate calibration with PLS and SIMCA classification. Paper II proposes a new method for the rapid determination of pharmaceutical solutions by the use of non-column chromatographic diode array UV spectroscopy, i.e. a conventional HPLC-DAD system without any chromatographic column connected. In Paper III an investigation is made of the ability of a control sample, of known content and identity to diagnose and correct errors in multivariate predictions something that together with use of multivariate residuals can make it possible to use the same calibration model over time. In Paper IV a method is proposed for simultaneous determination of serum proteins with fluorescence spectroscopy and multivariate calibration. Paper V proposes a method for the determination of chromatographic peak purity by means of PCA of HPLC-DAD data. In Paper VI PARAFAC is applied for the decomposition of DAD data of some partially separated peaks into the pure chromatographic, spectral and concentration profiles.
Resumo:
This thesis is based on five papers addressing variance reduction in different ways. The papers have in common that they all present new numerical methods. Paper I investigates quantitative structure-retention relationships from an image processing perspective, using an artificial neural network to preprocess three-dimensional structural descriptions of the studied steroid molecules. Paper II presents a new method for computing free energies. Free energy is the quantity that determines chemical equilibria and partition coefficients. The proposed method may be used for estimating, e.g., chromatographic retention without performing experiments. Two papers (III and IV) deal with correcting deviations from bilinearity by so-called peak alignment. Bilinearity is a theoretical assumption about the distribution of instrumental data that is often violated by measured data. Deviations from bilinearity lead to increased variance, both in the data and in inferences from the data, unless invariance to the deviations is built into the model, e.g., by the use of the method proposed in paper III and extended in paper IV. Paper V addresses a generic problem in classification; namely, how to measure the goodness of different data representations, so that the best classifier may be constructed. Variance reduction is one of the pillars on which analytical chemistry rests. This thesis considers two aspects on variance reduction: before and after experiments are performed. Before experimenting, theoretical predictions of experimental outcomes may be used to direct which experiments to perform, and how to perform them (papers I and II). After experiments are performed, the variance of inferences from the measured data are affected by the method of data analysis (papers III-V).
Resumo:
Statistical modelling and statistical learning theory are two powerful analytical frameworks for analyzing signals and developing efficient processing and classification algorithms. In this thesis, these frameworks are applied for modelling and processing biomedical signals in two different contexts: ultrasound medical imaging systems and primate neural activity analysis and modelling. In the context of ultrasound medical imaging, two main applications are explored: deconvolution of signals measured from a ultrasonic transducer and automatic image segmentation and classification of prostate ultrasound scans. In the former application a stochastic model of the radio frequency signal measured from a ultrasonic transducer is derived. This model is then employed for developing in a statistical framework a regularized deconvolution procedure, for enhancing signal resolution. In the latter application, different statistical models are used to characterize images of prostate tissues, extracting different features. These features are then uses to segment the images in region of interests by means of an automatic procedure based on a statistical model of the extracted features. Finally, machine learning techniques are used for automatic classification of the different region of interests. In the context of neural activity signals, an example of bio-inspired dynamical network was developed to help in studies of motor-related processes in the brain of primate monkeys. The presented model aims to mimic the abstract functionality of a cell population in 7a parietal region of primate monkeys, during the execution of learned behavioural tasks.
Resumo:
Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.