945 resultados para Data mining models


Relevância:

80.00% 80.00%

Publicador:

Resumo:

Driving on an approach to a signalized intersection while distracted is relatively risky, as potential vehicular conflicts and resulting angle collisions tend to be relatively more severe compared to other locations. Given the prevalence and importance of this particular scenario, the objective of this study was to examine the decisions and actions of distracted drivers during the onset of yellow lights. Driving simulator data were obtained from a sample of 69 drivers under baseline and handheld cell phone conditions at the University of Iowa – National Advanced Driving Simulator. Explanatory variables included age, gender, cell phone use, distance to stop-line, and speed. Although there is extensive research on drivers’ responses to yellow traffic signals, the examinations have been conducted from a traditional regression-based approach, which do not necessary provide the underlying relations and patterns among the sampled data. In this paper, we exploit the benefits of both classical statistical inference and data mining techniques to identify the a priori relationships among main effects, non-linearities, and interaction effects. Results suggest that the probability of yellow light running increases with the increase in driving speed at the onset of yellow. Both young (18–25 years) and middle-aged (30–45 years) drivers reveal reduced propensity for yellow light running whilst distracted across the entire speed range, exhibiting possible risk compensation during this critical driving situation. The propensity for yellow light running for both distracted male and female older (50–60 years) drivers is significantly higher. Driver experience captured by age interacts with distraction, resulting in their combined effect having slower physiological response and being distracted particularly risky.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Acoustic classification of anurans (frogs) has received increasing attention for its promising application in biological and environment studies. In this study, a novel feature extraction method for frog call classification is presented based on the analysis of spectrograms. The frog calls are first automatically segmented into syllables. Then, spectral peak tracks are extracted to separate desired signal (frog calls) from background noise. The spectral peak tracks are used to extract various syllable features, including: syllable duration, dominant frequency, oscillation rate, frequency modulation, and energy modulation. Finally, a k-nearest neighbor classifier is used for classifying frog calls based on the results of principal component analysis. The experiment results show that syllable features can achieve an average classification accuracy of 90.5% which outperforms Mel-frequency cepstral coefficients features (79.0%).

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Frog protection has become increasingly essential due to the rapid decline of its biodiversity. Therefore, it is valuable to develop new methods for studying this biodiversity. In this paper, a novel feature extraction method is proposed based on perceptual wavelet packet decomposition for classifying frog calls in noisy environments. Pre-processing and syllable segmentation are first applied to the frog call. Then, a spectral peak track is extracted from each syllable if possible. Track duration, dominant frequency and oscillation rate are directly extracted from the track. With k-means clustering algorithm, the calculated dominant frequency of all frog species is clustered into k parts, which produce a frequency scale for wavelet packet decomposition. Based on the adaptive frequency scale, wavelet packet decomposition is applied to the frog calls. Using the wavelet packet decomposition coefficients, a new feature set named perceptual wavelet packet decomposition sub-band cepstral coefficients is extracted. Finally, a k-nearest neighbour (k-NN) classifier is used for the classification. The experiment results show that the proposed features can achieve an average classification accuracy of 97.45% which outperforms syllable features (86.87%) and Mel-frequency cepstral coefficients (MFCCs) feature (90.80%).

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Frogs have received increasing attention due to their effectiveness for indicating the environment change. Therefore, it is important to monitor and assess frogs. With the development of sensor techniques, large volumes of audio data (including frog calls) have been collected and need to be analysed. After transforming the audio data into its spectrogram representation using short-time Fourier transform, the visual inspection of this representation motivates us to use image processing techniques for analysing audio data. Applying acoustic event detection (AED) method to spectrograms, acoustic events are firstly detected from which ridges are extracted. Three feature sets, Mel-frequency cepstral coefficients (MFCCs), AED feature set and ridge feature set, are then used for frog call classification with a support vector machine classifier. Fifteen frog species widely spread in Queensland, Australia, are selected to evaluate the proposed method. The experimental results show that ridge feature set can achieve an average classification accuracy of 74.73% which outperforms the MFCCs (38.99%) and AED feature set (67.78%).

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Over past few decades, frog species have been experiencing dramatic decline around the world. The reason for this decline includes habitat loss, invasive species, climate change and so on. To better know the status of frog species, classifying frogs has become increasingly important. In this study, acoustic features are investigated for multi-level classification of Australian frogs: family, genus and species, including three families, eleven genera and eighty five species which are collected from Queensland, Australia. For each frog species, six instances are selected from which ten acoustic features are calculated. Then, the multicollinearity between ten features are studied for selecting non-correlated features for subsequent analysis. A decision tree (DT) classifier is used to visually and explicitly determine which acoustic features are relatively important for classifying family, which for genus, and which for species. Finally, a weighted support vector machines (SVMs) classifier is used for the multi- level classification with three most important acoustic features respectively. Our experiment results indicate that using different acoustic feature sets can successfully classify frogs at different levels and the average classification accuracy can be up to 85.6%, 86.1% and 56.2% for family, genus and species respectively.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Document clustering is one of the prominent methods for mining important information from the vast amount of data available on the web. However, document clustering generally suffers from the curse of dimensionality. Providentially in high dimensional space, data points tend to be more concentrated in some areas of clusters. We take advantage of this phenomenon by introducing a novel concept of dynamic cluster representation named as loci. Clusters’ loci are efficiently calculated using documents’ ranking scores generated from a search engine. We propose a fast loci-based semi-supervised document clustering algorithm that uses clusters’ loci instead of conventional centroids for assigning documents to clusters. Empirical analysis on real-world datasets shows that the proposed method produces cluster solutions with promising quality and is substantially faster than several benchmarked centroid-based semi-supervised document clustering methods.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Bioacoustic data can be used for monitoring animal species diversity. The deployment of acoustic sensors enables acoustic monitoring at large temporal and spatial scales. We describe a content-based birdcall retrieval algorithm for the exploration of large data bases of acoustic recordings. In the algorithm, an event-based searching scheme and compact features are developed. In detail, ridge events are detected from audio files using event detection on spectral ridges. Then event alignment is used to search through audio files to locate candidate instances. A similarity measure is then applied to dimension-reduced spectral ridge feature vectors. The event-based searching method processes a smaller list of instances for faster retrieval. The experimental results demonstrate that our features achieve better success rate than existing methods and the feature dimension is greatly reduced.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The development of techniques for scaling up classifiers so that they can be applied to problems with large datasets of training examples is one of the objectives of data mining. Recently, AdaBoost has become popular among machine learning community thanks to its promising results across a variety of applications. However, training AdaBoost on large datasets is a major problem, especially when the dimensionality of the data is very high. This paper discusses the effect of high dimensionality on the training process of AdaBoost. Two preprocessing options to reduce dimensionality, namely the principal component analysis and random projection are briefly examined. Random projection subject to a probabilistic length preserving transformation is explored further as a computationally light preprocessing step. The experimental results obtained demonstrate the effectiveness of the proposed training process for handling high dimensional large datasets.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

We investigate methods for data-based selection of working covariance models in the analysis of correlated data with generalized estimating equations. We study two selection criteria: Gaussian pseudolikelihood and a geodesic distance based on discrepancy between model-sensitive and model-robust regression parameter covariance estimators. The Gaussian pseudolikelihood is found in simulation to be reasonably sensitive for several response distributions and noncanonical mean-variance relations for longitudinal data. Application is also made to a clinical dataset. Assessment of adequacy of both correlation and variance models for longitudinal data should be routine in applications, and we describe open-source software supporting this practice.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Selecting an appropriate working correlation structure is pertinent to clustered data analysis using generalized estimating equations (GEE) because an inappropriate choice will lead to inefficient parameter estimation. We investigate the well-known criterion of QIC for selecting a working correlation Structure. and have found that performance of the QIC is deteriorated by a term that is theoretically independent of the correlation structures but has to be estimated with an error. This leads LIS to propose a correlation information criterion (CIC) that substantially improves the QIC performance. Extensive simulation studies indicate that the CIC has remarkable improvement in selecting the correct correlation structures. We also illustrate our findings using a data set from the Madras Longitudinal Schizophrenia Study.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The method of generalised estimating equations for regression modelling of clustered outcomes allows for specification of a working matrix that is intended to approximate the true correlation matrix of the observations. We investigate the asymptotic relative efficiency of the generalised estimating equation for the mean parameters when the correlation parameters are estimated by various methods. The asymptotic relative efficiency depends on three-features of the analysis, namely (i) the discrepancy between the working correlation structure and the unobservable true correlation structure, (ii) the method by which the correlation parameters are estimated and (iii) the 'design', by which we refer to both the structures of the predictor matrices within clusters and distribution of cluster sizes. Analytical and numerical studies of realistic data-analysis scenarios show that choice of working covariance model has a substantial impact on regression estimator efficiency. Protection against avoidable loss of efficiency associated with covariance misspecification is obtained when a 'Gaussian estimation' pseudolikelihood procedure is used with an AR(1) structure.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

In this paper we tackle the problem of efficient video event detection. We argue that linear detection functions should be preferred in this regard due to their scalability and efficiency during estimation and evaluation. A popular approach in this regard is to represent a sequence using a bag of words (BOW) representation due to its: (i) fixed dimensionality irrespective of the sequence length, and (ii) its ability to compactly model the statistics in the sequence. A drawback to the BOW representation, however, is the intrinsic destruction of the temporal ordering information. In this paper we propose a new representation that leverages the uncertainty in relative temporal alignments between pairs of sequences while not destroying temporal ordering. Our representation, like BOW, is of a fixed dimensionality making it easily integrated with a linear detection function. Extensive experiments on CK+, 6DMG, and UvA-NEMO databases show significant performance improvements across both isolated and continuous event detection tasks.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Acoustic recordings play an increasingly important role in monitoring terrestrial and aquatic environments. However, rapid advances in technology make it possible to accumulate thousands of hours of recordings, more than ecologists can ever listen to. Our approach to this big-data challenge is to visualize the content of long-duration audio recordings on multiple scales, from minutes, hours, days to years. The visualization should facilitate navigation and yield ecologically meaningful information prior to listening to the audio. To construct images, we calculate acoustic indices, statistics that describe the distribution of acoustic energy and reflect content of ecological interest. We combine various indices to produce false-color spectrogram images that reveal acoustic content and facilitate navigation. The technical challenge we investigate in this work is how to navigate recordings that are days or even months in duration. We introduce a method of zooming through multiple temporal scales, analogous to Google Maps. However, the “landscape” to be navigated is not geographical and not therefore intrinsically visual, but rather a graphical representation of the underlying audio. We describe solutions to navigating spectrograms that range over three orders of magnitude of temporal scale. We make three sets of observations: 1. We determine that at least ten intermediate scale steps are required to zoom over three orders of magnitude of temporal scale; 2. We determine that three different visual representations are required to cover the range of temporal scales; 3. We present a solution to the problem of maintaining visual continuity when stepping between different visual representations. Finally, we demonstrate the utility of the approach with four case studies.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Helicobacter pylori infection is a risk factor for gastric cancer, which is a major health issue worldwide. Gastric cancer has a poor prognosis due to the unnoticeable progression of the disease and surgery is the only available treatment in gastric cancer. Therefore, gastric cancer patients would greatly benefit from identifying biomarker genes that would improve diagnostic and prognostic prediction and provide targets for molecular therapies. DNA copy number amplifications are the hallmarks of cancers in various anatomical locations. Mechanisms of amplification predict that DNA double-strand breaks occur at the margins of the amplified region. The first objective of this thesis was to identify the genes that were differentially expressed in H. pylori infection as well as the transcription factors and signal transduction pathways that were associated with the gene expression changes. The second objective was to identify putative biomarker genes in gastric cancer with correlated expression and copy number, and the last objective was to characterize cancers based on DNA copy number amplifications. DNA microarrays, an in vitro model and real-time polymerase chain reaction were used to measure gene expression changes in H. pylori infected AGS cells. In order to identify the transcription factors and signal transduction pathways that were activated after H. pylori infection, gene expression profiling data from the H. pylori experiments and a bioinformatics approach accompanied by experimental validation were used. Genome-wide expression and copy number microarray analysis of clinical gastric cancer samples and immunohistochemistry on tissue microarray were used to identify putative gastric cancer genes. Data mining and machine learning techniques were applied to study amplifications in a cross-section of cancers. FOS and various stress response genes were regulated by H. pylori infection. H. pylori regulated genes were enriched in the chromosomal regions that are frequently changed in gastric cancer, suggesting that molecular pathways of gastric cancer and premalignant H. pylori infection that induces gastritis are interconnected. 16 transcription factors were identified as being associated with H. pylori infection induced changes in gene expression. NF-κB transcription factor and p50 and p65 subunits were verified using elecrophoretic mobility shift assays. ERBB2 and other genes located in 17q12- q21 were found to be up-regulated in association with copy number amplification in gastric cancer. Cancers with similar cell type and origin clustered together based on the genomic localization of the amplifications. Cancer genes and large genes were co-localized with amplified regions and fragile sites, telomeres, centromeres and light chromosome bands were enriched at the amplification boundaries. H. pylori activated transcription factors and signal transduction pathways function in cellular mechanisms that might be capable of promoting carcinogenesis of the stomach. Intestinal and diffuse type gastric cancers showed distinct molecular genetic profiles. Integration of gene expression and copy number microarray data allowed the identification of genes that might be involved in gastric carcinogenesis and have clinical relevance. Gene amplifications were demonstrated to be non-random genomic instabilities. Cell lineage, properties of precursor stem cells, tissue microenvironment and genomic map localization of specific oncogenes define the site specificity of DNA amplifications, whereas labile genomic features define the structures of amplicons. These conclusions suggest that the definition of genomic changes in cancer is based on the interplay between the cancer cell and the tumor microenvironment.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

This research is a step forward in discovering knowledge from databases of complex structure like tree or graph. Several data mining algorithms are developed based on a novel representation called Balanced Optimal Search for extracting implicit, unknown and potentially useful information like patterns, similarities and various relationships from tree data, which are also proved to be advantageous in analysing big data. This thesis focuses on analysing unordered tree data, which is robust to data inconsistency, irregularity and swift information changes, hence, in the era of big data it becomes a popular and widely used data model.