908 resultados para Pre-processing


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Road networks are a national critical infrastructure. The road assets need to be monitored and maintained efficiently as their conditions deteriorate over time. The condition of one of such assets, road pavement, plays a major role in the road network maintenance programmes. Pavement conditions depend upon many factors such as pavement types, traffic and environmental conditions. This paper presents a data analytics case study for assessing the factors affecting the pavement deflection values measured by the traffic speed deflectometer (TSD) device. The analytics process includes acquisition and integration of data from multiple sources, data pre-processing, mining useful information from them and utilising data mining outputs for knowledge deployment. Data mining techniques are able to show how TSD outputs vary in different roads, traffic and environmental conditions. The generated data mining models map the TSD outputs to some classes and define correction factors for each class.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

MapReduce frameworks such as Hadoop are well suited to handling large sets of data which can be processed separately and independently, with canonical applications in information retrieval and sales record analysis. Rapid advances in sequencing technology have ensured an explosion in the availability of genomic data, with a consequent rise in the importance of large scale comparative genomics, often involving operations and data relationships which deviate from the classical Map Reduce structure. This work examines the application of Hadoop to patterns of this nature, using as our focus a wellestablished workflow for identifying promoters - binding sites for regulatory proteins - Across multiple gene regions and organisms, coupled with the unifying step of assembling these results into a consensus sequence. Our approach demonstrates the utility of Hadoop for problems of this nature, showing how the tyranny of the "dominant decomposition" can be at least partially overcome. It also demonstrates how load balance and the granularity of parallelism can be optimized by pre-processing that splits and reorganizes input files, allowing a wide range of related problems to be brought under the same computational umbrella.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Determining similarity between business process models has recently gained interest in the business process management community. So far similarity was addressed separately either at semantic or structural aspect of process models. Also, most of the contributions that measure similarity of process models assume an ideal case when process models are enriched with semantics - a description of meaning of process model elements. However, in real life this results in a heavy human effort consuming pre-processing phase which is often not feasible. In this paper we propose an automated approach for querying a business process model repository for structurally and semantically relevant models. Similar to the search on the Internet, a user formulates a BPMN-Q query and as a result receives a list of process models ordered by relevance to the query. We provide a business process model search engine implementation for evaluation of the proposed approach.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This study aims to assess the accuracy of Digital Elevation Model (DEM) which is generated by using Toutin’s model. Thus, Toutin’s model was run by using OrthoEngineSE of PCI Geomatics 10.3.Thealong-track stereoimages of Advanced Spaceborne Thermal Emission and Reflection radiometer (ASTER) sensor with 15 m resolution were used to produce DEM on an area with low and near Mean Sea Level (MSL) elevation in Johor Malaysia. Despite the satisfactory pre-processing results the visual assessment of the DEM generated from Toutin’s model showed that the DEM contained many outliers and incorrect values. The failure of Toutin’s model may mostly be due to the inaccuracy and insufficiency of ASTER ephemeris data for low terrains as well as huge water body in the stereo images.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Description of a patient's injuries is recorded in narrative text form by hospital emergency departments. For statistical reporting, this text data needs to be mapped to pre-defined codes. Existing research in this field uses the Naïve Bayes probabilistic method to build classifiers for mapping. In this paper, we focus on providing guidance on the selection of a classification method. We build a number of classifiers belonging to different classification families such as decision tree, probabilistic, neural networks, and instance-based, ensemble-based and kernel-based linear classifiers. An extensive pre-processing is carried out to ensure the quality of data and, in hence, the quality classification outcome. The records with a null entry in injury description are removed. The misspelling correction process is carried out by finding and replacing the misspelt word with a soundlike word. Meaningful phrases have been identified and kept, instead of removing the part of phrase as a stop word. The abbreviations appearing in many forms of entry are manually identified and only one form of abbreviations is used. Clustering is utilised to discriminate between non-frequent and frequent terms. This process reduced the number of text features dramatically from about 28,000 to 5000. The medical narrative text injury dataset, under consideration, is composed of many short documents. The data can be characterized as high-dimensional and sparse, i.e., few features are irrelevant but features are correlated with one another. Therefore, Matrix factorization techniques such as Singular Value Decomposition (SVD) and Non Negative Matrix Factorization (NNMF) have been used to map the processed feature space to a lower-dimensional feature space. Classifiers with these reduced feature space have been built. In experiments, a set of tests are conducted to reflect which classification method is best for the medical text classification. The Non Negative Matrix Factorization with Support Vector Machine method can achieve 93% precision which is higher than all the tested traditional classifiers. We also found that TF/IDF weighting which works well for long text classification is inferior to binary weighting in short document classification. Another finding is that the Top-n terms should be removed in consultation with medical experts, as it affects the classification performance.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Narrative text is a useful way of identifying injury circumstances from the routine emergency department data collections. Automatically classifying narratives based on machine learning techniques is a promising technique, which can consequently reduce the tedious manual classification process. Existing works focus on using Naive Bayes which does not always offer the best performance. This paper proposes the Matrix Factorization approaches along with a learning enhancement process for this task. The results are compared with the performance of various other classification approaches. The impact on the classification results from the parameters setting during the classification of a medical text dataset is discussed. With the selection of right dimension k, Non Negative Matrix Factorization-model method achieves 10 CV accuracy of 0.93.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Increasingly larger scale applications are generating an unprecedented amount of data. However, the increasing gap between computation and I/O capacity on High End Computing machines makes a severe bottleneck for data analysis. Instead of moving data from its source to the output storage, in-situ analytics processes output data while simulations are running. However, in-situ data analysis incurs much more computing resource contentions with simulations. Such contentions severely damage the performance of simulation on HPE. Since different data processing strategies have different impact on performance and cost, there is a consequent need for flexibility in the location of data analytics. In this paper, we explore and analyze several potential data-analytics placement strategies along the I/O path. To find out the best strategy to reduce data movement in given situation, we propose a flexible data analytics (FlexAnalytics) framework in this paper. Based on this framework, a FlexAnalytics prototype system is developed for analytics placement. FlexAnalytics system enhances the scalability and flexibility of current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and visualization, as well as for large-scale data transfer. Two use cases – scientific data compression and remote visualization – have been applied in the study to verify the performance of FlexAnalytics. Experimental results demonstrate that FlexAnalytics framework increases data transition bandwidth and improves the application end-to-end transfer performance.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Frog protection has become increasingly essential due to the rapid decline of its biodiversity. Therefore, it is valuable to develop new methods for studying this biodiversity. In this paper, a novel feature extraction method is proposed based on perceptual wavelet packet decomposition for classifying frog calls in noisy environments. Pre-processing and syllable segmentation are first applied to the frog call. Then, a spectral peak track is extracted from each syllable if possible. Track duration, dominant frequency and oscillation rate are directly extracted from the track. With k-means clustering algorithm, the calculated dominant frequency of all frog species is clustered into k parts, which produce a frequency scale for wavelet packet decomposition. Based on the adaptive frequency scale, wavelet packet decomposition is applied to the frog calls. Using the wavelet packet decomposition coefficients, a new feature set named perceptual wavelet packet decomposition sub-band cepstral coefficients is extracted. Finally, a k-nearest neighbour (k-NN) classifier is used for the classification. The experiment results show that the proposed features can achieve an average classification accuracy of 97.45% which outperforms syllable features (86.87%) and Mel-frequency cepstral coefficients (MFCCs) feature (90.80%).

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Near infrared spectroscopy (NIRS) combined with multivariate analysis techniques was applied to assess phenol content of European oak. NIRS data were firstly collected directly from solid heartwood surfaces: in doing so, the spectra were recorded separately from the longitudinal radial and the transverse section surfaces by diffuse reflectance. The spectral data were then pretreated by several pre-processing procedures, such as multiplicative scatter correction, first derivative, second derivative and standard normal variate. The tannin contents of sawmill collected from the longitudinal radial and transverse section surfaces were determined by quantitative extraction with water/methanol (1:4, by vol). Then, total phenol contents in tannin extracts were measured by the Folin-Ciocalteu method. The NIR data were correlated against the Folin-Ciocalteu results. Calibration models built with partial least squares regression displayed strong correlation - as expressed by high determination correlation coefficient (r2) and high ratio of performance to deviation (RPD) - between measured and predicted total phenols content, and weak calibration and prediction errors (RMSEC, RMSEP). The best calibration was provided with second derivative spectra (r2 value of 0.93 for the longitudinal radial plane and of 0.91 for the transverse section plane). This study illustrates that the NIRS technique when used in conjunction with multivariate analysis could provide reliable, quick and non-destructive assessment of European oak heartwood extractives.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A commercial non-specific gas sensor array system was evaluated in terms of its capability to monitor the odour abatement performance of a biofiltration system developed for treating emissions from a commercial piggery building. The biofiltration system was a modular system comprising an inlet ducting system, humidifier and closed-bed biofilter. It also included a gravimetric moisture monitoring and water application system for precise control of moisture content of an organic woodchip medium. Principal component analysis (PCA) of the sensor array measurements indicated that the biofilter outlet air was significantly different to both inlet air of the system and post-humidifier air. Data pre-processing techniques including normalising and outlier handling were applied to improve the odour discrimination performance of the non-specific gas sensor array. To develop an odour quantification model using the sensor array responses of the non-specific sensor array, PCA regression, artificial neural network (ANN) and partial least squares (PLS) modelling techniques were applied. The correlation coefficient (r(2)) values of the PCA, ANN, and PLS models were 0.44, 0.62 and 0.79, respectively.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

This thesis examines the feasibility of a forest inventory method based on two-phase sampling in estimating forest attributes at the stand or substand levels for forest management purposes. The method is based on multi-source forest inventory combining auxiliary data consisting of remote sensing imagery or other geographic information and field measurements. Auxiliary data are utilized as first-phase data for covering all inventory units. Various methods were examined for improving the accuracy of the forest estimates. Pre-processing of auxiliary data in the form of correcting the spectral properties of aerial imagery was examined (I), as was the selection of aerial image features for estimating forest attributes (II). Various spatial units were compared for extracting image features in a remote sensing aided forest inventory utilizing very high resolution imagery (III). A number of data sources were combined and different weighting procedures were tested in estimating forest attributes (IV, V). Correction of the spectral properties of aerial images proved to be a straightforward and advantageous method for improving the correlation between the image features and the measured forest attributes. Testing different image features that can be extracted from aerial photographs (and other very high resolution images) showed that the images contain a wealth of relevant information that can be extracted only by utilizing the spatial organization of the image pixel values. Furthermore, careful selection of image features for the inventory task generally gives better results than inputting all extractable features to the estimation procedure. When the spatial units for extracting very high resolution image features were examined, an approach based on image segmentation generally showed advantages compared with a traditional sample plot-based approach. Combining several data sources resulted in more accurate estimates than any of the individual data sources alone. The best combined estimate can be derived by weighting the estimates produced by the individual data sources by the inverse values of their mean square errors. Despite the fact that the plot-level estimation accuracy in two-phase sampling inventory can be improved in many ways, the accuracy of forest estimates based mainly on single-view satellite and aerial imagery is a relatively poor basis for making stand-level management decisions.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

A unate function can easily be identified on a Karnaugh map from the well-known property that it cons ist s only ofess en ti al prime implicante which intersect at a common implicant. The additional property that the plot of a unate function F(x, ... XII) on a Karnaugh map should possess in order that F may also be Ivrealizable (n';:; 6) has been found. It has been sh own that the I- realizability of a unate function F corresponds to the ' compac tness' of the plot of F. No resort to tho inequalities is made, and no pre-processing such as positivizing and ordering of the given function is required.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Partitional clustering algorithms, which partition the dataset into a pre-defined number of clusters, can be broadly classified into two types: algorithms which explicitly take the number of clusters as input and algorithms that take the expected size of a cluster as input. In this paper, we propose a variant of the k-means algorithm and prove that it is more efficient than standard k-means algorithms. An important contribution of this paper is the establishment of a relation between the number of clusters and the size of the clusters in a dataset through the analysis of our algorithm. We also demonstrate that the integration of this algorithm as a pre-processing step in classification algorithms reduces their running-time complexity.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper the approach for automatic road extraction for an urban region using structural, spectral and geometric characteristics of roads has been presented. Roads have been extracted based on two levels: Pre-processing and road extraction methods. Initially, the image is pre-processed to improve the tolerance by reducing the clutter (that mostly represents the buildings, parking lots, vegetation regions and other open spaces). The road segments are then extracted using Texture Progressive Analysis (TPA) and Normalized cut algorithm. The TPA technique uses binary segmentation based on three levels of texture statistical evaluation to extract road segments where as, Normalizedcut method for road extraction is a graph based method that generates optimal partition of road segments. The performance evaluation (quality measures) for road extraction using TPA and normalized cut method is compared. Thus the experimental result show that normalized cut method is efficient in extracting road segments in urban region from high resolution satellite image.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

It has been shown in an earlier paper that I-realizability of a unate function F of up to six variables corresponds to ' compactness ' of the plot of F on a Karnaugh map. Here, an algorithm has been presented to synthesize on a Karnaugh map a non-threahold function of up to Bix variables with the minimum number of threshold gates connected in cascade. Incompletely specified functions can also be treated. No resort to inequalities is made and no pre-processing (such as positivizing and ordering) of the given switching function is required.