773 resultados para Data Mining, Rough Sets, Multi-Dimension, Association Rules, Constraint
Resumo:
Keyword Spotting is the task of detecting keywords of interest within continu- ous speech. The applications of this technology range from call centre dialogue systems to covert speech surveillance devices. Keyword spotting is particularly well suited to data mining tasks such as real-time keyword monitoring and unre- stricted vocabulary audio document indexing. However, to date, many keyword spotting approaches have su®ered from poor detection rates, high false alarm rates, or slow execution times, thus reducing their commercial viability. This work investigates the application of keyword spotting to data mining tasks. The thesis makes a number of major contributions to the ¯eld of keyword spotting. The ¯rst major contribution is the development of a novel keyword veri¯cation method named Cohort Word Veri¯cation. This method combines high level lin- guistic information with cohort-based veri¯cation techniques to obtain dramatic improvements in veri¯cation performance, in particular for the problematic short duration target word class. The second major contribution is the development of a novel audio document indexing technique named Dynamic Match Lattice Spotting. This technique aug- ments lattice-based audio indexing principles with dynamic sequence matching techniques to provide robustness to erroneous lattice realisations. The resulting algorithm obtains signi¯cant improvement in detection rate over lattice-based audio document indexing while still maintaining extremely fast search speeds. The third major contribution is the study of multiple veri¯er fusion for the task of keyword veri¯cation. The reported experiments demonstrate that substantial improvements in veri¯cation performance can be obtained through the fusion of multiple keyword veri¯ers. The research focuses on combinations of speech background model based veri¯ers and cohort word veri¯ers. The ¯nal major contribution is a comprehensive study of the e®ects of limited training data for keyword spotting. This study is performed with consideration as to how these e®ects impact the immediate development and deployment of speech technologies for non-English languages.
Resumo:
In a seminal data mining article, Leo Breiman [1] argued that to develop effective predictive classification and regression models, we need to move away from the sole dependency on statistical algorithms and embrace a wider toolkit of modeling algorithms that include data mining procedures. Nevertheless, many researchers still rely solely on statistical procedures when undertaking data modeling tasks; the sole reliance on these procedures has lead to the development of irrelevant theory and questionable research conclusions ([1], p.199). We will outline initiatives that the HPC & Research Support group is undertaking to engage researchers with data mining tools and techniques; including a new range of seminars, workshops, and one-on-one consultations covering data mining algorithms, the relationship between data mining and the research cycle, and limitations and problems with these new algorithms. Organisational limitations and restrictions to these initiatives are also discussed.
Resumo:
Advances in data mining have provided techniques for automatically discovering underlying knowledge and extracting useful information from large volumes of data. Data mining offers tools for quick discovery of relationships, patterns and knowledge in large complex databases. Application of data mining to manufacturing is relatively limited mainly because of complexity of manufacturing data. Growing self organizing map (GSOM) algorithm has been proven to be an efficient algorithm to analyze unsupervised DNA data. However, it produced unsatisfactory clustering when used on some large manufacturing data. In this paper a data mining methodology has been proposed using a GSOM tool which was developed using a modified GSOM algorithm. The proposed method is used to generate clusters for good and faulty products from a manufacturing dataset. The clustering quality (CQ) measure proposed in the paper is used to evaluate the performance of the cluster maps. The paper also proposed an automatic identification of variables to find the most probable causative factor(s) that discriminate between good and faulty product by quickly examining the historical manufacturing data. The proposed method offers the manufacturers to smoothen the production flow and improve the quality of the products. Simulation results on small and large manufacturing data show the effectiveness of the proposed method.
Resumo:
Road safety is a major concern worldwide. Road safety will improve as road conditions and their effects on crashes are continually investigated. This paper proposes to use the capability of data mining to include the greater set of road variables for all available crashes with skid resistance values across the Queensland state main road network in order to understand the relationships among crash, traffic and road variables. This paper presents a data mining based methodology for the road asset management data to find out the various road properties that contribute unduly to crashes. The models demonstrate high levels of accuracy in predicting crashes in roads when various road properties are included. This paper presents the findings of these models to show the relationships among skid resistance, crashes, crash characteristics and other road characteristics such as seal type, seal age, road type, texture depth, lane count, pavement width, rutting, speed limit, traffic rates intersections, traffic signage and road design and so on.
Resumo:
It is commonly accepted that wet roads have higher risk of crash than dry roads; however, providing evidence to support this assumption presents some difficulty. This paper presents a data mining case study in which predictive data mining is applied to model the skid resistance and crash relationship to search for discernable differences in the probability of wet and dry road segments having crashes based on skid resistance. The models identify an increased probability of wet road segments having crashes for mid-range skid resistance values.
Resumo:
This thesis investigates profiling and differentiating customers through the use of statistical data mining techniques. The business application of our work centres on examining individuals’ seldomly studied yet critical consumption behaviour over an extensive time period within the context of the wireless telecommunication industry; consumption behaviour (as oppose to purchasing behaviour) is behaviour that has been performed so frequently that it become habitual and involves minimal intentions or decision making. Key variables investigated are the activity initialised timestamp and cell tower location as well as the activity type and usage quantity (e.g., voice call with duration in seconds); and the research focuses are on customers’ spatial and temporal usage behaviour. The main methodological emphasis is on the development of clustering models based on Gaussian mixture models (GMMs) which are fitted with the use of the recently developed variational Bayesian (VB) method. VB is an efficient deterministic alternative to the popular but computationally demandingMarkov chainMonte Carlo (MCMC) methods. The standard VBGMMalgorithm is extended by allowing component splitting such that it is robust to initial parameter choices and can automatically and efficiently determine the number of components. The new algorithm we propose allows more effective modelling of individuals’ highly heterogeneous and spiky spatial usage behaviour, or more generally human mobility patterns; the term spiky describes data patterns with large areas of low probability mixed with small areas of high probability. Customers are then characterised and segmented based on the fitted GMM which corresponds to how each of them uses the products/services spatially in their daily lives; this is essentially their likely lifestyle and occupational traits. Other significant research contributions include fitting GMMs using VB to circular data i.e., the temporal usage behaviour, and developing clustering algorithms suitable for high dimensional data based on the use of VB-GMM.
Resumo:
Road asset managers are overwhelmed with a high volume of raw data which they need to process and utilise in supporting their decision making. This paper presents a method that processes road-crash data of a whole road network and exposes hidden value inherent in the data by deploying the clustering data mining method. The goal of the method is to partition the road network into a set of groups (classes) based on common data and characterise the class crash types to produce a crash profiles for each cluster. By comparing similar road classes with differing crash types and rates, insight can be gained into these differences that are caused by the particular characteristics of their roads. These differences can be used as evidence in knowledge development and decision support.
Resumo:
With the increasing number of XML documents in varied domains, it has become essential to identify ways of finding interesting information from these documents. Data mining techniques were used to derive this interesting information. Mining on XML documents is impacted by its model due to the semi-structured nature of these documents. Hence, in this chapter we present an overview of the various models of XML documents, how these models were used for mining and some of the issues and challenges in these models. In addition, this chapter also provides some insights into the future models of XML documents for effectively capturing the two important features namely structure and content of XML documents for mining.
Resumo:
The rapid growth in the number of users using social networks and the information that a social network requires about their users make the traditional matching systems insufficiently adept at matching users within social networks. This paper introduces the use of clustering to form communities of users and, then, uses these communities to generate matches. Forming communities within a social network helps to reduce the number of users that the matching system needs to consider, and helps to overcome other problems from which social networks suffer, such as the absence of user activities' information about a new user. The proposed system has been evaluated on a dataset obtained from an online dating website. Empirical analysis shows that accuracy of the matching process is increased using the community information.
Resumo:
The research team recognized the value of network-level Falling Weight Deflectometer (FWD) testing to evaluate the structural condition trends of flexible pavements. However, practical limitations due to the cost of testing, traffic control and safety concerns and the ability to test a large network may discourage some agencies from conducting the network-level FWD testing. For this reason, the surrogate measure of the Structural Condition Index (SCI) is suggested for use. The main purpose of the research presented in this paper is to investigate data mining strategies and to develop a prediction method of the structural condition trends for network-level applications which does not require FWD testing. The research team first evaluated the existing and historical pavement condition, distress, ride, traffic and other data attributes in the Texas Department of Transportation (TxDOT) Pavement Maintenance Information System (PMIS), applied data mining strategies to the data, discovered useful patterns and knowledge for SCI value prediction, and finally provided a reasonable measure of pavement structural condition which is correlated to the SCI. To evaluate the performance of the developed prediction approach, a case study was conducted using the SCI data calculated from the FWD data collected on flexible pavements over a 5-year period (2005 – 09) from 354 PMIS sections representing 37 pavement sections on the Texas highway system. The preliminary study results showed that the proposed approach can be used as a supportive pavement structural index in the event when FWD deflection data is not available and help pavement managers identify the timing and appropriate treatment level of preventive maintenance activities.
Resumo:
Key decisions at the collection, pre-processing, transformation, mining and interpretation phase of any knowledge discovery from database (KDD) process depend heavily on assumptions and theorectical perspectives relating to the type of task to be performed and characteristics of data sourced. In this article, we compare and contrast theoretical perspectives and assumptions taken in data mining exercises in the legal domain with those adopted in data mining in TCM and allopathic medicine. The juxtaposition results in insights for the application of KDD for Traditional Chinese Medicine.
Resumo:
This thesis improves the process of recommending people to people in social networks using new clustering algorithms and ranking methods. The proposed system and methods are evaluated on the data collected from a real life social network. The empirical analysis of this research confirms that the proposed system and methods achieved improvements in the accuracy and efficiency of matching and recommending people, and overcome some of the problems that social matching systems usually suffer.