996 resultados para data preprocessing


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Traffic flow time series data are usually high dimensional and very complex. Also they are sometimes imprecise and distorted due to data collection sensor malfunction. Additionally, events like congestion caused by traffic accidents add more uncertainty to real-time traffic conditions, making traffic flow forecasting a complicated task. This article presents a new data preprocessing method targeting multidimensional time series with a very high number of dimensions and shows its application to real traffic flow time series from the California Department of Transportation (PEMS web site). The proposed method consists of three main steps. First, based on a language for defining events in multidimensional time series, mTESL, we identify a number of types of events in time series that corresponding to either incorrect data or data with interference. Second, each event type is restored utilizing an original method that combines real observations, local forecasted values and historical data. Third, an exponential smoothing procedure is applied globally to eliminate noise interference and other random errors so as to provide good quality source data for future work.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In the present study, using noise-free simulated signals, we performed a comparative examination of several preprocessing techniques that are used to transform the cardiac event series in a regularly sampled time series, appropriate for spectral analysis of heart rhythm variability (HRV). First, a group of noise-free simulated point event series, which represents a time series of heartbeats, was generated by an integral pulse frequency modulation model. In order to evaluate the performance of the preprocessing methods, the differences between the spectra of the preprocessed simulated signals and the true spectrum (spectrum of the model input modulating signals) were surveyed by visual analysis and by contrasting merit indices. It is desired that estimated spectra match the true spectrum as close as possible, showing a minimum of harmonic components and other artifacts. The merit indices proposed to quantify these mismatches were the leakage rate, defined as a measure of leakage components (located outside some narrow windows centered at frequencies of model input modulating signals) with respect to the whole spectral components, and the numbers of leakage components with amplitudes greater than 1%, 5% and 10% of the total spectral components. Our data, obtained from a noise-free simulation, indicate that the utilization of heart rate values instead of heart period values in the derivation of signals representative of heart rhythm results in more accurate spectra. Furthermore, our data support the efficiency of the widely used preprocessing technique based on the convolution of inverse interval function values with a rectangular window, and suggest the preprocessing technique based on a cubic polynomial interpolation of inverse interval function values and succeeding spectral analysis as another efficient and fast method for the analysis of HRV signals

Relevância:

70.00% 70.00%

Publicador:

Resumo:

We have investigated the use of hierarchical clustering of flow cytometry data to classify samples of conventional central chondrosarcoma, a malignant cartilage forming tumor of uncertain cellular origin, according to similarities with surface marker profiles of several known cell types. Human primary chondrosarcoma cells, articular chondrocytes, mesenchymal stem cells, fibroblasts, and a panel of tumor cell lines from chondrocytic or epithelial origin were clustered based on the expression profile of eleven surface markers. For clustering, eight hierarchical clustering algorithms, three distance metrics, as well as several approaches for data preprocessing, including multivariate outlier detection, logarithmic transformation, and z-score normalization, were systematically evaluated. By selecting clustering approaches shown to give reproducible results for cluster recovery of known cell types, primary conventional central chondrosacoma cells could be grouped in two main clusters with distinctive marker expression signatures: one group clustering together with mesenchymal stem cells (CD49b-high/CD10-low/CD221-high) and a second group clustering close to fibroblasts (CD49b-low/CD10-high/CD221-low). Hierarchical clustering also revealed substantial differences between primary conventional central chondrosarcoma cells and established chondrosarcoma cell lines, with the latter not only segregating apart from primary tumor cells and normal tissue cells, but clustering together with cell lines from epithelial lineage. Our study provides a foundation for the use of hierarchical clustering applied to flow cytometry data as a powerful tool to classify samples according to marker expression patterns, which could lead to uncover new cancer subtypes.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The accurate and reliable estimation of travel time based on point detector data is needed to support Intelligent Transportation System (ITS) applications. It has been found that the quality of travel time estimation is a function of the method used in the estimation and varies for different traffic conditions. In this study, two hybrid on-line travel time estimation models, and their corresponding off-line methods, were developed to achieve better estimation performance under various traffic conditions, including recurrent congestion and incidents. The first model combines the Mid-Point method, which is a speed-based method, with a traffic flow-based method. The second model integrates two speed-based methods: the Mid-Point method and the Minimum Speed method. In both models, the switch between travel time estimation methods is based on the congestion level and queue status automatically identified by clustering analysis. During incident conditions with rapidly changing queue lengths, shock wave analysis-based refinements are applied for on-line estimation to capture the fast queue propagation and recovery. Travel time estimates obtained from existing speed-based methods, traffic flow-based methods, and the models developed were tested using both simulation and real-world data. The results indicate that all tested methods performed at an acceptable level during periods of low congestion. However, their performances vary with an increase in congestion. Comparisons with other estimation methods also show that the developed hybrid models perform well in all cases. Further comparisons between the on-line and off-line travel time estimation methods reveal that off-line methods perform significantly better only during fast-changing congested conditions, such as during incidents. The impacts of major influential factors on the performance of travel time estimation, including data preprocessing procedures, detector errors, detector spacing, frequency of travel time updates to traveler information devices, travel time link length, and posted travel time range, were investigated in this study. The results show that these factors have more significant impacts on the estimation accuracy and reliability under congested conditions than during uncongested conditions. For the incident conditions, the estimation quality improves with the use of a short rolling period for data smoothing, more accurate detector data, and frequent travel time updates.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A miniaturised gas analyser is described and evaluated based on the use of a substrate-integrated hollow waveguide (iHWG) coupled to a microsized near-infrared spectrophotometer comprising a linear variable filter and an array of InGaAs detectors. This gas sensing system was applied to analyse surrogate samples of natural fuel gas containing methane, ethane, propane and butane, quantified by using multivariate regression models based on partial least square (PLS) algorithms and Savitzky-Golay 1(st) derivative data preprocessing. The external validation of the obtained models reveals root mean square errors of prediction of 0.37, 0.36, 0.67 and 0.37% (v/v), for methane, ethane, propane and butane, respectively. The developed sensing system provides particularly rapid response times upon composition changes of the gaseous sample (approximately 2 s) due the minute volume of the iHWG-based measurement cell. The sensing system developed in this study is fully portable with a hand-held sized analyser footprint, and thus ideally suited for field analysis. Last but not least, the obtained results corroborate the potential of NIR-iHWG analysers for monitoring the quality of natural gas and petrochemical gaseous products.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The Lattes platform is the major scientific information system maintained by the National Council for Scientific and Technological Development (CNPq). This platform allows to manage the curricular information of researchers and institutions working in Brazil based on the so called Lattes Curriculum. However, the public information is individually available for each researcher, not providing the automatic creation of reports of several scientific productions for research groups. It is thus difficult to extract and to summarize useful knowledge for medium to large size groups of researchers. This paper describes the design, implementation and experiences with scriptLattes: an open-source system to create academic reports of groups based on curricula of the Lattes Database. The scriptLattes system is composed by the following modules: (a) data selection, (b) data preprocessing, (c) redundancy treatment, (d) collaboration graph generation among group members, (e) research map generation based on geographical information, and (f) automatic report creation of bibliographical, technical and artistic production, and academic supervisions. The system has been extensively tested for a large variety of research groups of Brazilian institutions, and the generated reports have shown an alternative to easily extract knowledge from data in the context of Lattes platform. The source code, usage instructions and examples are available at http://scriptlattes.sourceforge.net/.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Feature selection is one of important and frequently used techniques in data preprocessing. It can improve the efficiency and the effectiveness of data mining by reducing the dimensions of feature space and removing the irrelevant and redundant information. Feature selection can be viewed as a global optimization problem of finding a minimum set of M relevant features that describes the dataset as well as the original N attributes. In this paper, we apply the adaptive partitioned random search strategy into our feature selection algorithm. Under this search strategy, the partition structure and evaluation function is proposed for feature selection problem. This algorithm ensures the global optimal solution in theory and avoids complete randomness in search direction. The good property of our algorithm is shown through the theoretical analysis.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Problem of modeling of anaesthesia depth level is studied in this Master Thesis. It applies analysis of EEG signals with nonlinear dynamics theory and further classification of obtained values. The main stages of this study are the following: data preprocessing; calculation of optimal embedding parameters for phase space reconstruction; obtaining reconstructed phase portraits of each EEG signal; formation of the feature set to characterise obtained phase portraits; classification of four different anaesthesia levels basing on previously estimated features. Classification was performed with: Linear and quadratic Discriminant Analysis, k Nearest Neighbours method and online clustering. In addition, this work provides overview of existing approaches to anaesthesia depth monitoring, description of basic concepts of nonlinear dynamics theory used in this Master Thesis and comparative analysis of several different classification methods.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Learning disability (LD) is a neurological condition that affects a child’s brain and impairs his ability to carry out one or many specific tasks. LD affects about 10% of children enrolled in schools. There is no cure for learning disabilities and they are lifelong. The problems of children with specific learning disabilities have been a cause of concern to parents and teachers for some time. Just as there are many different types of LDs, there are a variety of tests that may be done to pinpoint the problem The information gained from an evaluation is crucial for finding out how the parents and the school authorities can provide the best possible learning environment for child. This paper proposes a new approach in artificial neural network (ANN) for identifying LD in children at early stages so as to solve the problems faced by them and to get the benefits to the students, their parents and school authorities. In this study, we propose a closest fit algorithm data preprocessing with ANN classification to handle missing attribute values. This algorithm imputes the missing values in the preprocessing stage. Ignoring of missing attribute values is a common trend in all classifying algorithms. But, in this paper, we use an algorithm in a systematic approach for classification, which gives a satisfactory result in the prediction of LD. It acts as a tool for predicting the LD accurately, and good information of the child is made available to the concerned

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Learning Disability (LD) is a neurological condition that affects a child’s brain and impairs his ability to carry out one or many specific tasks. LD affects about 15 % of children enrolled in schools. The prediction of LD is a vital and intricate job. The aim of this paper is to design an effective and powerful tool, using the two intelligent methods viz., Artificial Neural Network and Adaptive Neuro-Fuzzy Inference System, for measuring the percentage of LD that affected in school-age children. In this study, we are proposing some soft computing methods in data preprocessing for improving the accuracy of the tool as well as the classifier. The data preprocessing is performed through Principal Component Analysis for attribute reduction and closest fit algorithm is used for imputing missing values. The main idea in developing the LD prediction tool is not only to predict the LD present in children but also to measure its percentage along with its class like low or minor or major. The system is implemented in Mathworks Software MatLab 7.10. The results obtained from this study have illustrated that the designed prediction system or tool is capable of measuring the LD effectively

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A new database of weather and circulation type catalogs is presented comprising 17 automated classification methods and five subjective classifications. It was compiled within COST Action 733 "Harmonisation and Applications of Weather Type Classifications for European regions" in order to evaluate different methods for weather and circulation type classification. This paper gives a technical description of the included methods using a new conceptual categorization for classification methods reflecting the strategy for the definition of types. Methods using predefined types include manual and threshold based classifications while methods producing types derived from the input data include those based on eigenvector techniques, leader algorithms and optimization algorithms. In order to allow direct comparisons between the methods, the circulation input data and the methods' configuration were harmonized for producing a subset of standard catalogs of the automated methods. The harmonization includes the data source, the climatic parameters used, the classification period as well as the spatial domain and the number of types. Frequency based characteristics of the resulting catalogs are presented, including variation of class sizes, persistence, seasonal and inter-annual variability as well as trends of the annual frequency time series. The methodological concept of the classifications is partly reflected by these properties of the resulting catalogs. It is shown that the types of subjective classifications compared to automated methods show higher persistence, inter-annual variation and long-term trends. Among the automated classifications optimization methods show a tendency for longer persistence and higher seasonal variation. However, it is also concluded that the distance metric used and the data preprocessing play at least an equally important role for the properties of the resulting classification compared to the algorithm used for type definition and assignment.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Relevância:

60.00% 60.00%

Publicador:

Resumo:

OBJECTIVES Molecular subclassification of non small-cell lung cancer (NSCLC) is essential to improve clinical outcome. This study assessed the prognostic and predictive value of circulating micro-RNA (miRNA) in patients with non-squamous NSCLC enrolled in the phase II SAKK (Swiss Group for Clinical Cancer Research) trial 19/05, receiving uniform treatment with first-line bevacizumab and erlotinib followed by platinum-based chemotherapy at progression. MATERIALS AND METHODS Fifty patients with baseline and 24 h blood samples were included from SAKK 19/05. The primary study endpoint was to identify prognostic (overall survival, OS) miRNA's. Patient samples were analyzed with Agilent human miRNA 8x60K microarrays, each glass slide formatted with eight high-definition 60K arrays. Each array contained 40 probes targeting each of the 1347 miRNA. Data preprocessing included quantile normalization using robust multi-array average (RMA) algorithm. Prognostic and predictive miRNA expression profiles were identified by Spearman's rank correlation test (percentage tumor shrinkage) or log-rank testing (for time-to-event endpoints). RESULTS Data preprocessing kept 49 patients and 424 miRNA for further analysis. Ten miRNA's were significantly associated with OS, with hsa-miR-29a being the strongest prognostic marker (HR=6.44, 95%-CI 2.39-17.33). Patients with high has-miR-29a expression had a significantly lower survival at 10 months compared to patients with a low expression (54% versus 83%). Six out of the 10 miRNA's (hsa-miRN-29a, hsa-miR-542-5p, hsa-miR-502-3p, hsa-miR-376a, hsa-miR-500a, hsa-miR-424) were insensitive to perturbations according to jackknife cross-validation on their HR for OS. The respective principal component analysis (PCA) defined a meta-miRNA signature including the same 6 miRNA's, resulting in a HR of 0.66 (95%-CI 0.53-0.82). CONCLUSION Cell-free circulating miRNA-profiling successfully identified a highly prognostic 6-gene signature in patients with advanced non-squamous NSCLC. Circulating miRNA profiling should further be validated in external cohorts for the selection and monitoring of systemic treatment in patients with advanced NSCLC.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Random Forests™ is reported to be one of the most accurate classification algorithms in complex data analysis. It shows excellent performance even when most predictors are noisy and the number of variables is much larger than the number of observations. In this thesis Random Forests was applied to a large-scale lung cancer case-control study. A novel way of automatically selecting prognostic factors was proposed. Also, synthetic positive control was used to validate Random Forests method. Throughout this study we showed that Random Forests can deal with large number of weak input variables without overfitting. It can account for non-additive interactions between these input variables. Random Forests can also be used for variable selection without being adversely affected by collinearities. ^ Random Forests can deal with the large-scale data sets without rigorous data preprocessing. It has robust variable importance ranking measure. Proposed is a novel variable selection method in context of Random Forests that uses the data noise level as the cut-off value to determine the subset of the important predictors. This new approach enhanced the ability of the Random Forests algorithm to automatically identify important predictors for complex data. The cut-off value can also be adjusted based on the results of the synthetic positive control experiments. ^ When the data set had high variables to observations ratio, Random Forests complemented the established logistic regression. This study suggested that Random Forests is recommended for such high dimensionality data. One can use Random Forests to select the important variables and then use logistic regression or Random Forests itself to estimate the effect size of the predictors and to classify new observations. ^ We also found that the mean decrease of accuracy is a more reliable variable ranking measurement than mean decrease of Gini. ^