856 resultados para sequence data mining
Resumo:
This thesis takes a new data mining approach for analyzing road/crash data by developing models for the whole road network and generating a crash risk profile. Roads with an elevated crash risk due to road surface friction deficit are identified. The regression tree model, predicting road segment crash rate, is applied in a novel deployment coined regression tree extrapolation that produces a skid resistance/crash rate curve. Using extrapolation allows the method to be applied across the network and cope with the high proportion of missing road surface friction values. This risk profiling method can be applied in other domains.
Resumo:
Road surface skid resistance has been shown to have a strong relationship to road crash risk, however, applying the current method of using investigatory levels to identify crash prone roads is problematic as they may fail in identifying risky roads outside of the norm. The proposed method analyses a complex and formerly impenetrable volume of data from roads and crashes using data mining. This method rapidly identifies roads with elevated crash-rate, potentially due to skid resistance deficit, for investigation. A hypothetical skid resistance/crash risk curve is developed for each road segment, driven by the model deployed in a novel regression tree extrapolation method. The method potentially solves the problem of missing skid resistance values which occurs during network-wide crash analysis, and allows risk assessment of the major proportion of roads without skid resistance values.
Resumo:
This thesis is a study for automatic discovery of text features for describing user information needs. It presents an innovative data-mining approach that discovers useful knowledge from both relevance and non-relevance feedback information. The proposed approach can largely reduce noises in discovered patterns and significantly improve the performance of text mining systems. This study provides a promising method for the study of Data Mining and Web Intelligence.
Resumo:
This thesis describes the development of a robust and novel prototype to address the data quality problems that relate to the dimension of outlier data. It thoroughly investigates the associated problems with regards to detecting, assessing and determining the severity of the problem of outlier data; and proposes granule-mining based alternative techniques to significantly improve the effectiveness of mining and assessing outlier data.
Resumo:
This paper uses innovative content analysis techniques to map how the death of Oscar Pistorius' girlfriend, Reeva Steenkamp, was framed on Twitter conversations. Around 1.5 million posts from a two-week timeframe are analyzed with a combination of syntactic and semantic methods. This analysis is grounded in the frame analysis perspective and is different than sentiment analysis. Instead of looking for explicit evaluations, such as “he is guilty” or “he is innocent”, we showcase through the results how opinions can be identified by complex articulations of more implicit symbolic devices such as examples and metaphors repeatedly mentioned. Different frames are adopted by users as more information about the case is revealed: from a more episodic one, highly used in the very beginning, to more systemic approaches, highlighting the association of the event with urban violence, gun control issues, and violence against women. A detailed timeline of the discussions is provided.
Resumo:
Road asset managers are seeking analysis of the whole road network to supplement statistical analyses of small subsets of homogeneous roadway. This study outlines the use of data mining capable of analyzing the wide range of situations found on the network, with a focus on the role of skid resistance in the cause of crashes. Results from the analyses show that on non-crash-prone roads with low crash rates, skid resistance contributes only in a minor way, whereas on high-crash roadways, skid resistance often contributes significantly in the calculation of the crash rate. The results provide evidence supporting a causal relationship between skid resistance and crashes and highlight the importance of the role of skid resistance in decision making in road asset management.
Resumo:
This research is a step forward in improving the accuracy of detecting anomaly in a data graph representing connectivity between people in an online social network. The proposed hybrid methods are based on fuzzy machine learning techniques utilising different types of structural input features. The methods are presented within a multi-layered framework which provides the full requirements needed for finding anomalies in data graphs generated from online social networks, including data modelling and analysis, labelling, and evaluation.
Resumo:
Identifying product families has been considered as an effective way to accommodate the increasing product varieties across the diverse market niches. In this paper, we propose a novel framework to identifying product families by using a similarity measure for a common product design data BOM (Bill of Materials) based on data mining techniques such as frequent mining and clus-tering. For calculating the similarity between BOMs, a novel Extended Augmented Adjacency Matrix (EAAM) representation is introduced that consists of information not only of the content and topology but also of the fre-quent structural dependency among the various parts of a product design. These EAAM representations of BOMs are compared to calculate the similarity between products and used as a clustering input to group the product fami-lies. When applied on a real-life manufacturing data, the proposed framework outperforms a current baseline that uses orthogonal Procrustes for grouping product families.
Resumo:
Due to the availability of huge number of web services, finding an appropriate Web service according to the requirements of a service consumer is still a challenge. Moreover, sometimes a single web service is unable to fully satisfy the requirements of the service consumer. In such cases, combinations of multiple inter-related web services can be utilised. This paper proposes a method that first utilises a semantic kernel model to find related services and then models these related Web services as nodes of a graph. An all-pair shortest-path algorithm is applied to find the best compositions of Web services that are semantically related to the service consumer requirement. The recommendation of individual and composite Web services composition for a service request is finally made. Empirical evaluation confirms that the proposed method significantly improves the accuracy of service discovery in comparison to traditional keyword-based discovery methods.
Resumo:
Protein adsorption at solid-liquid interfaces is critical to many applications, including biomaterials, protein microarrays and lab-on-a-chip devices. Despite this general interest, and a large amount of research in the last half a century, protein adsorption cannot be predicted with an engineering level, design-orientated accuracy. Here we describe a Biomolecular Adsorption Database (BAD), freely available online, which archives the published protein adsorption data. Piecewise linear regression with breakpoint applied to the data in the BAD suggests that the input variables to protein adsorption, i.e., protein concentration in solution; protein descriptors derived from primary structure (number of residues, global protein hydrophobicity and range of amino acid hydrophobicity, isoelectric point); surface descriptors (contact angle); and fluid environment descriptors (pH, ionic strength), correlate well with the output variable-the protein concentration on the surface. Furthermore, neural network analysis revealed that the size of the BAD makes it sufficiently representative, with a neural network-based predictive error of 5% or less. Interestingly, a consistently better fit is obtained if the BAD is divided in two separate sub-sets representing protein adsorption on hydrophilic and hydrophobic surfaces, respectively. Based on these findings, selected entries from the BAD have been used to construct neural network-based estimation routines, which predict the amount of adsorbed protein, the thickness of the adsorbed layer and the surface tension of the protein-covered surface. While the BAD is of general interest, the prediction of the thickness and the surface tension of the protein-covered layers are of particular relevance to the design of microfluidics devices.
Resumo:
Rolling-element bearing failures are the most frequent problems in rotating machinery, which can be catastrophic and cause major downtime. Hence, providing advance failure warning and precise fault detection in such components are pivotal and cost-effective. The vast majority of past research has focused on signal processing and spectral analysis for fault diagnostics in rotating components. In this study, a data mining approach using a machine learning technique called anomaly detection (AD) is presented. This method employs classification techniques to discriminate between defect examples. Two features, kurtosis and Non-Gaussianity Score (NGS), are extracted to develop anomaly detection algorithms. The performance of the developed algorithms was examined through real data from a test to failure bearing. Finally, the application of anomaly detection is compared with one of the popular methods called Support Vector Machine (SVM) to investigate the sensitivity and accuracy of this approach and its ability to detect the anomalies in early stages.
Resumo:
The pharaoh cuttle Sepia pharaonis Ehrenberg, 1831 (Mollusca: Cephalopoda: Sepiida) is a broadly distributed species of substantial fisheries importance found from east Africa to southern Japan. Little is known about S. pharaonis phylogeography, but evidence from morphology and reproductive biology suggests that Sepia pharaonis is actually a complex of at least three species. To evaluate this possibility, we collected tissue samples from Sepia pharaonis from throughout its range. Phylogenetic analyses of partial mitochondrial 16S sequences from these samples reveal five distinct clades: a Gulf of Aden/Red Sea clade, a northern Australia clade, a Persian Gulf/Arabian Sea clade, a western Pacific clade (Gulf of Thailand and Taiwan) and an India/Andaman Sea clade. Phylogenetic analyses including several Sepia species show that S. pharaonis sensu lato may not be monophyletic. We suggest that "S. pharaonis" may consist of up to five species, but additional data will be required to fully clarify relationships within the S. pharaonis complex.
Resumo:
Graminicolous downy mildews (GDM) are an understudied, yet economically important, group of plant pathogens, which are one of the major constraints to poaceous crops in the tropics and subtropics. Here we present a first molecular phylogeny based on cox2 sequences comprising all genera of the GDM currently accepted, with both lasting (Graminivora, Poakatesthia, and Viennotia) and evanescent (Peronosclerospora, Sclerophthora, and Sclerospora) sporangiophores. In addition, all other downy mildew genera currently accepted, as well as a representative sample of other oomycete taxa, have been included. It was shown that all genera of the GDM have had a long, independent evolutionary history, and that the delineation between Peronosclerospora and Sclerospora is correct. Sclerophthora was found to be a particularly divergent taxon nested within a paraphyletic Phytophthora, but without support. The results confirm that the placement of Peronosclerospora and Sclerospora in the Saprolegniomycetidae is incorrect. Sclerophthora is not closely related to Pachymetra of the family Verrucalvaceae, and also does not belong to the Saprolegniomycetidae, but shows close affinities to the Peronosporaceae. In addition, all GDM are interspersed throughout the Peronosporaceae s lat., suggesting that a separate family for the Sclerosporaceae might not be justified.