905 resultados para Classification Methods
Resumo:
This thesis documents the design, implementation and testing of a smart sensing platform that is able to discriminate between differences or small changes in a persons walking. The distributive tactile sensing method is used to monitor the deflection of the platform surface using just a small number of sensors and, through the use of neural networks, infer the characteristics of the object in contact with the surface. The thesis first describes the development of a mathematical model which uses a novel method to track the position of a moving load as it passes over the smart sensing surface. Experimental methods are then described for using the platform to track the position of swinging pendulum in three dimensions. It is demonstrated that the method can be extended to that of real-time measurement of balance and sway of a person during quiet standing. Current classification methods are then investigated for use in the classification of different gait patterns, in particular to identify individuals by their unique gait pattern. Based on these observations, a novel algorithm is developed that is able to discriminate between abnormal and affected gait. This algorithm, using the distributive tactile sensing method, was found to have greater accuracy than other methods investigated and was designed to be able to cope with any type of gait variation. The system developed in this thesis has applications in the area of medical diagnostics, either as an initial screening tool for detecting walking disorders or to be able to automatically detect changes in gait over time. The system could also be used as a discrete biometric identification method, for example identifying office workers as they pass over the surface.
Resumo:
Sentiment analysis concerns about automatically identifying sentiment or opinion expressed in a given piece of text. Most prior work either use prior lexical knowledge defined as sentiment polarity of words or view the task as a text classification problem and rely on labeled corpora to train a sentiment classifier. While lexicon-based approaches do not adapt well to different domains, corpus-based approaches require expensive manual annotation effort. In this paper, we propose a novel framework where an initial classifier is learned by incorporating prior information extracted from an existing sentiment lexicon with preferences on expectations of sentiment labels of those lexicon words being expressed using generalized expectation criteria. Documents classified with high confidence are then used as pseudo-labeled examples for automatical domain-specific feature acquisition. The word-class distributions of such self-learned features are estimated from the pseudo-labeled examples and are used to train another classifier by constraining the model's predictions on unlabeled instances. Experiments on both the movie-review data and the multi-domain sentiment dataset show that our approach attains comparable or better performance than existing weakly-supervised sentiment classification methods despite using no labeled documents.
Resumo:
Objective: Recently, much research has been proposed using nature inspired algorithms to perform complex machine learning tasks. Ant colony optimization (ACO) is one such algorithm based on swarm intelligence and is derived from a model inspired by the collective foraging behavior of ants. Taking advantage of the ACO in traits such as self-organization and robustness, this paper investigates ant-based algorithms for gene expression data clustering and associative classification. Methods and material: An ant-based clustering (Ant-C) and an ant-based association rule mining (Ant-ARM) algorithms are proposed for gene expression data analysis. The proposed algorithms make use of the natural behavior of ants such as cooperation and adaptation to allow for a flexible robust search for a good candidate solution. Results: Ant-C has been tested on the three datasets selected from the Stanford Genomic Resource Database and achieved relatively high accuracy compared to other classical clustering methods. Ant-ARM has been tested on the acute lymphoblastic leukemia (ALL)/acute myeloid leukemia (AML) dataset and generated about 30 classification rules with high accuracy. Conclusions: Ant-C can generate optimal number of clusters without incorporating any other algorithms such as K-means or agglomerative hierarchical clustering. For associative classification, while a few of the well-known algorithms such as Apriori, FP-growth and Magnum Opus are unable to mine any association rules from the ALL/AML dataset within a reasonable period of time, Ant-ARM is able to extract associative classification rules.
Resumo:
Quantitative structure-activity relationship (QSAR) analysis is a cornerstone of modern informatics. Predictive computational models of peptide-major histocompatibility complex (MHC)-binding affinity based on QSAR technology have now become important components of modern computational immunovaccinology. Historically, such approaches have been built around semiqualitative, classification methods, but these are now giving way to quantitative regression methods. We review three methods--a 2D-QSAR additive-partial least squares (PLS) and a 3D-QSAR comparative molecular similarity index analysis (CoMSIA) method--which can identify the sequence dependence of peptide-binding specificity for various class I MHC alleles from the reported binding affinities (IC50) of peptide sets. The third method is an iterative self-consistent (ISC) PLS-based additive method, which is a recently developed extension to the additive method for the affinity prediction of class II peptides. The QSAR methods presented here have established themselves as immunoinformatic techniques complementary to existing methodology, useful in the quantitative prediction of binding affinity: current methods for the in silico identification of T-cell epitopes (which form the basis of many vaccines, diagnostics, and reagents) rely on the accurate computational prediction of peptide-MHC affinity. We have reviewed various human and mouse class I and class II allele models. Studied alleles comprise HLA-A*0101, HLA-A*0201, HLA-A*0202, HLA-A*0203, HLA-A*0206, HLA-A*0301, HLA-A*1101, HLA-A*3101, HLA-A*6801, HLA-A*6802, HLA-B*3501, H2-K(k), H2-K(b), H2-D(b) HLA-DRB1*0101, HLA-DRB1*0401, HLA-DRB1*0701, I-A(b), I-A(d), I-A(k), I-A(S), I-E(d), and I-E(k). In this chapter we show a step-by-step guide into predicting the reliability and the resulting models to represent an advance on existing methods. The peptides used in this study are available from the AntiJen database (http://www.jenner.ac.uk/AntiJen). The PLS method is available commercially in the SYBYL molecular modeling software package. The resulting models, which can be used for accurate T-cell epitope prediction, will be made are freely available online at the URL http://www.jenner.ac.uk/MHCPred.
Resumo:
Quantitative structure–activity relationship (QSAR) analysis is a main cornerstone of modern informatic disciplines. Predictive computational models, based on QSAR technology, of peptide-major histocompatibility complex (MHC) binding affinity have now become a vital component of modern day computational immunovaccinology. Historically, such approaches have been built around semi-qualitative, classification methods, but these are now giving way to quantitative regression methods. The additive method, an established immunoinformatics technique for the quantitative prediction of peptide–protein affinity, was used here to identify the sequence dependence of peptide binding specificity for three mouse class I MHC alleles: H2–Db, H2–Kb and H2–Kk. As we show, in terms of reliability the resulting models represent a significant advance on existing methods. They can be used for the accurate prediction of T-cell epitopes and are freely available online (http://www.jenner.ac.uk/MHCPred).
Resumo:
Social media has become an effective channel for communicating both trends and public opinion on current events. However the automatic topic classification of social media content pose various challenges. Topic classification is a common technique used for automatically capturing themes that emerge from social media streams. However, such techniques are sensitive to the evolution of topics when new event-dependent vocabularies start to emerge (e.g., Crimea becoming relevant to War Conflict during the Ukraine crisis in 2014). Therefore, traditional supervised classification methods which rely on labelled data could rapidly become outdated. In this paper we propose a novel transfer learning approach to address the classification task of new data when the only available labelled data belong to a previous epoch. This approach relies on the incorporation of knowledge from DBpedia graphs. Our findings show promising results in understanding how features age, and how semantic features can support the evolution of topic classifiers.
Resumo:
Resource discovery is one of the key services in digitised cultural heritage collections. It requires intelligent mining in heterogeneous digital content as well as capabilities in large scale performance; this explains the recent advances in classification methods. Associative classifiers are convenient data mining tools used in the field of cultural heritage, by applying their possibilities to taking into account the specific combinations of the attribute values. Usually, the associative classifiers prioritize the support over the confidence. The proposed classifier PGN questions this common approach and focuses on confidence first by retaining only 100% confidence rules. The classification tasks in the field of cultural heritage usually deal with data sets with many class labels. This variety is caused by the richness of accumulated culture during the centuries. Comparisons of classifier PGN with other classifiers, such as OneR, JRip and J48, show the competitiveness of PGN in recognizing multi-class datasets on collections of masterpieces from different West and East European Fine Art authors and movements.
Resumo:
Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space
Resumo:
The microarray technology provides a high-throughput technique to study gene expression. Microarrays can help us diagnose different types of cancers, understand biological processes, assess host responses to drugs and pathogens, find markers for specific diseases, and much more. Microarray experiments generate large amounts of data. Thus, effective data processing and analysis are critical for making reliable inferences from the data. ^ The first part of dissertation addresses the problem of finding an optimal set of genes (biomarkers) to classify a set of samples as diseased or normal. Three statistical gene selection methods (GS, GS-NR, and GS-PCA) were developed to identify a set of genes that best differentiate between samples. A comparative study on different classification tools was performed and the best combinations of gene selection and classifiers for multi-class cancer classification were identified. For most of the benchmarking cancer data sets, the gene selection method proposed in this dissertation, GS, outperformed other gene selection methods. The classifiers based on Random Forests, neural network ensembles, and K-nearest neighbor (KNN) showed consistently god performance. A striking commonality among these classifiers is that they all use a committee-based approach, suggesting that ensemble classification methods are superior. ^ The same biological problem may be studied at different research labs and/or performed using different lab protocols or samples. In such situations, it is important to combine results from these efforts. The second part of the dissertation addresses the problem of pooling the results from different independent experiments to obtain improved results. Four statistical pooling techniques (Fisher inverse chi-square method, Logit method. Stouffer's Z transform method, and Liptak-Stouffer weighted Z-method) were investigated in this dissertation. These pooling techniques were applied to the problem of identifying cell cycle-regulated genes in two different yeast species. As a result, improved sets of cell cycle-regulated genes were identified. The last part of dissertation explores the effectiveness of wavelet data transforms for the task of clustering. Discrete wavelet transforms, with an appropriate choice of wavelet bases, were shown to be effective in producing clusters that were biologically more meaningful. ^
Resumo:
Effective conservation and management of top predators requires a comprehensive understanding of their distributions and of the underlying biological and physical processes that affect these distributions. The Mid-Atlantic Bight shelf break system is a dynamic and productive region where at least 32 species of cetaceans have been recorded through various systematic and opportunistic marine mammal surveys from the 1970s through 2012. My dissertation characterizes the spatial distribution and habitat of cetaceans in the Mid-Atlantic Bight shelf break system by utilizing marine mammal line-transect survey data, synoptic multi-frequency active acoustic data, and fine-scale hydrographic data collected during the 2011 summer Atlantic Marine Assessment Program for Protected Species (AMAPPS) survey. Although studies describing cetacean habitat and distributions have been previously conducted in the Mid-Atlantic Bight, my research specifically focuses on the shelf break region to elucidate both the physical and biological processes that influence cetacean distribution patterns within this cetacean hotspot.
In Chapter One I review biologically important areas for cetaceans in the Atlantic waters of the United States. I describe the study area, the shelf break region of the Mid-Atlantic Bight, in terms of the general oceanography, productivity and biodiversity. According to recent habitat-based cetacean density models, the shelf break region is an area of high cetacean abundance and density, yet little research is directed at understanding the mechanisms that establish this region as a cetacean hotspot.
In Chapter Two I present the basic physical principles of sound in water and describe the methodology used to categorize opportunistically collected multi-frequency active acoustic data using frequency responses techniques. Frequency response classification methods are usually employed in conjunction with net-tow data, but the logistics of the 2011 AMAPPS survey did not allow for appropriate net-tow data to be collected. Biologically meaningful information can be extracted from acoustic scattering regions by comparing the frequency response curves of acoustic regions to theoretical curves of known scattering models. Using the five frequencies on the EK60 system (18, 38, 70, 120, and 200 kHz), three categories of scatterers were defined: fish-like (with swim bladder), nekton-like (e.g., euphausiids), and plankton-like (e.g., copepods). I also employed a multi-frequency acoustic categorization method using three frequencies (18, 38, and 120 kHz) that has been used in the Gulf of Maine and Georges Bank which is based the presence or absence of volume backscatter above a threshold. This method is more objective than the comparison of frequency response curves because it uses an established backscatter value for the threshold. By removing all data below the threshold, only strong scattering information is retained.
In Chapter Three I analyze the distribution of the categorized acoustic regions of interest during the daytime cross shelf transects. Over all transects, plankton-like acoustic regions of interest were detected most frequently, followed by fish-like acoustic regions and then nekton-like acoustic regions. Plankton-like detections were the only significantly different acoustic detections per kilometer, although nekton-like detections were only slightly not significant. Using the threshold categorization method by Jech and Michaels (2006) provides a more conservative and discrete detection of acoustic scatterers and allows me to retrieve backscatter values along transects in areas that have been categorized. This provides continuous data values that can be integrated at discrete spatial increments for wavelet analysis. Wavelet analysis indicates significant spatial scales of interest for fish-like and nekton-like acoustic backscatter range from one to four kilometers and vary among transects.
In Chapter Four I analyze the fine scale distribution of cetaceans in the shelf break system of the Mid-Atlantic Bight using corrected sightings per trackline region, classification trees, multidimensional scaling, and random forest analysis. I describe habitat for common dolphins, Risso’s dolphins and sperm whales. From the distribution of cetacean sightings, patterns of habitat start to emerge: within the shelf break region of the Mid-Atlantic Bight, common dolphins were sighted more prevalently over the shelf while sperm whales were more frequently found in the deep waters offshore and Risso’s dolphins were most prevalent at the shelf break. Multidimensional scaling presents clear environmental separation among common dolphins and Risso’s dolphins and sperm whales. The sperm whale random forest habitat model had the lowest misclassification error (0.30) and the Risso’s dolphin random forest habitat model had the greatest misclassification error (0.37). Shallow water depth (less than 148 meters) was the primary variable selected in the classification model for common dolphin habitat. Distance to surface density fronts and surface temperature fronts were the primary variables selected in the classification models to describe Risso’s dolphin habitat and sperm whale habitat respectively. When mapped back into geographic space, these three cetacean species occupy different fine-scale habitats within the dynamic Mid-Atlantic Bight shelf break system.
In Chapter Five I present a summary of the previous chapters and present potential analytical steps to address ecological questions pertaining the dynamic shelf break region. Taken together, the results of my dissertation demonstrate the use of opportunistically collected data in ecosystem studies; emphasize the need to incorporate middle trophic level data and oceanographic features into cetacean habitat models; and emphasize the importance of developing more mechanistic understanding of dynamic ecosystems.
Resumo:
With recent advances in remote sensing processing technology, it has become more feasible to begin analysis of the enormous historic archive of remotely sensed data. This historical data provides valuable information on a wide variety of topics which can influence the lives of millions of people if processed correctly and in a timely manner. One such field of benefit is that of landslide mapping and inventory. This data provides a historical reference to those who live near high risk areas so future disasters may be avoided. In order to properly map landslides remotely, an optimum method must first be determined. Historically, mapping has been attempted using pixel based methods such as unsupervised and supervised classification. These methods are limited by their ability to only characterize an image spectrally based on single pixel values. This creates a result prone to false positives and often without meaningful objects created. Recently, several reliable methods of Object Oriented Analysis (OOA) have been developed which utilize a full range of spectral, spatial, textural, and contextual parameters to delineate regions of interest. A comparison of these two methods on a historical dataset of the landslide affected city of San Juan La Laguna, Guatemala has proven the benefits of OOA methods over those of unsupervised classification. Overall accuracies of 96.5% and 94.3% and F-score of 84.3% and 77.9% were achieved for OOA and unsupervised classification methods respectively. The greater difference in F-score is a result of the low precision values of unsupervised classification caused by poor false positive removal, the greatest shortcoming of this method.
Resumo:
Conventional rockmass characterization and analysis methods for geotechnical assessment in mining, civil tunnelling, and other excavations consider only the intact rock properties and the discrete fractures that are present and form blocks within rockmasses. Field logging and classification protocols are based on historically useful but highly simplified design techniques, including direct empirical design and empirical strength assessment for simplified ground reaction and support analysis. As modern underground excavations go deeper and enter into more high stress environments with complex excavation geometries and associated stress paths, healed structures within initially intact rock blocks such as sedimentary nodule boundaries and hydrothermal veins, veinlets and stockwork (termed intrablock structure) are having an increasing influence on rockmass behaviour and should be included in modern geotechnical design. Due to the reliance on geotechnical classification methods which predate computer aided analysis, these complexities are ignored in conventional design. Given the comparatively complex, sophisticated and powerful numerical simulation and analysis techniques now practically available to the geotechnical engineer, this research is driven by the need for enhanced characterization of intrablock structure for application to numerical methods. Intrablock structure governs stress-driven behaviour at depth, gravity driven disintegration for large shallow spans, and controls ultimate fragmentation. This research addresses the characterization of intrablock structure and the understanding of its behaviour at laboratory testing and excavation scales, and presents new methodologies and tools to incorporate intrablock structure into geotechnical design practice. A new field characterization tool, the Composite Geological Strength Index, is used for outcrop or excavation face evaluation and provides direct input to continuum numerical models with implicit rockmass structure. A brittle overbreak estimation tool for complex rockmasses is developed using field observations. New methods to evaluate geometrical and mechanical properties of intrablock structure are developed. Finally, laboratory direct shear testing protocols for interblock structure are critically evaluated and extended to intrablock structure for the purpose of determining input parameters for numerical models with explicit structure.
Resumo:
INTRODUCTION: Esophageal adenocarcinoma (EAC) is a severe malignancy in terms of prognosis and mortality rate. Because its great genetic heterogeneity, disputes regarding classification, prevention and treatments are still unsolved. AIM: We investigated intra- and inter-EAC heterogeneity by defining EAC’s somatic mutational profile and the role of candidate microRNAs, to correlate the molecular profile of tumors to clinical outcomes and to identify biomarkers for classification. METHODS: 38 EAC cases were analyzed via high-throughput cell sorting technology combined with targeted sequencing and whole genome low-pass sequencing. Targeted sequencing of further 169 cases was performed to widen the study. miR221 and miR483-3p expression was profiled via qPCR in 112 EACs and correlation with clinical outcomes was investigated. RESULTS: 35/38 EACs carried at least one somatic mutation absent in stromal cells. TP53 was found mutated in 73.7% of cases. Selective sorting revealed tumor subclones with different mutational loads and copy number alterations, confirming the high intra-tumor heterogeneity of EAC. Mutations were in most cases at homozygous state, and we identified alterations that were missed with the whole-tumor analysis. Mutations in HNF1A gene, not previously associated with EAC, were identified in both cohorts. Higher expression of miR483-3p and miR221 was associated with poorer cancer specific survival (P=0.0293 and P=0.0059), and recurrence in the Lauren intestinal subtype (P=0.0459 and P=0.0002). Median expression levels of miRNAs were higher in patients with advanced tumor stages. The loss of SMAD4 immunoreactivity was significantly associated with poorer cancer specific survival and recurrence (P=0.0452; P=0.022 respectively). CONCLUSION: Combining selective sorting technology and next generation sequencing allowed to better define EAC inter- and intra-tumor heterogeneity. We identified HNF1A as a new mutated gene associated to EAC that could be involved in tumor progression and promising biomarkers such as SMAD4, miR221 and miR483-3p to identify patients at higher risk for more aggressive tumors.
Resumo:
Frankfurters are widely consumed all over the world, and the production requires a wide range of meat and non-meat ingredients. Due to these characteristics, frankfurters are products that can be easily adulterated with lower value meats, and the presence of undeclared species. Adulterations are often still difficult to detect, due the fact that the adulterant components are usually very similar to the authentic product. In this work, FT-Raman spectroscopy was employed as a rapid technique for assessing the quality of frankfurters. Based on information provided by the Raman spectra, a multivariate classification model was developed to identify the frankfurter type. The aim was to study three types of frankfurters (chicken, turkey and mixed meat) according to their Raman spectra, based on the fatty vibrational bands. Classification model was built using partial least square discriminant analysis (PLS-DA) and the performance model was evaluated in terms of sensitivity, specificity, accuracy, efficiency and Matthews's correlation coefficient. The PLS-DA models give sensitivity and specificity values on the test set in the ranges of 88%-100%, showing good performance of the classification models. The work shows the Raman spectroscopy with chemometric tools can be used as an analytical tool in quality control of frankfurters.
Resumo:
ABSTRACT The objective of this work was to study the distribution of values of the coefficient of variation (CV) in the experiments of papaya crop (Carica papaya L.) by proposing ranges to guide researchers in their evaluation for different characters in the field. The data used in this study were obtained by bibliographical review in Brazilian journals, dissertations and thesis. This study considered the following characters: diameter of the stalk, insertion height of the first fruit, plant height, number of fruits per plant, fruit biomass, fruit length, equatorial diameter of the fruit, pulp thickness, fruit firmness, soluble solids and internal cavity diameter, from which, value ranges were obtained for the CV values for each character, based on the methodology proposed by Garcia, Costa and by the standard classification of Pimentel-Gomes. The results obtained in this study indicated that ranges of CV values were different among various characters, presenting a large variation, which justifies the necessity of using specific evaluation range for each character. In addition, the use of classification ranges obtained from methodology of Costa is recommended.