952 resultados para Text Pre-Processing


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The region of greatest variability on soil maps is along the edge of their polygons, causing disagreement among pedologists about the appropriate description of soil classes at these locations. The objective of this work was to propose a strategy for data pre-processing applied to digital soil mapping (DSM). Soil polygons on a training map were shrunk by 100 and 160 m. This strategy prevented the use of covariates located near the edge of the soil classes for the Decision Tree (DT) models. Three DT models derived from eight predictive covariates, related to relief and organism factors sampled on the original polygons of a soil map and on polygons shrunk by 100 and 160 m were used to predict soil classes. The DT model derived from observations 160 m away from the edge of the polygons on the original map is less complex and has a better predictive performance.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

To construct Biodiversity richness maps from Environmental Niche Models (ENMs) of thousands of species is time consuming. A separate species occurrence data pre-processing phase enables the experimenter to control test AUC score variance due to species dataset size. Besides, removing duplicate occurrences and points with missing environmental data, we discuss the need for coordinate precision, wide dispersion, temporal and synonymity filters. After species data filtering, the final task of a pre-processing phase should be the automatic generation of species occurrence datasets which can then be directly ’plugged-in’ to the ENM. A software application capable of carrying out all these tasks will be a valuable time-saver particularly for large scale biodiversity studies.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Today several different unsupervised classification algorithms are commonly used to cluster similar patterns in a data set based only on its statistical properties. Specially in image data applications, self-organizing methods for unsupervised classification have been successfully applied for clustering pixels or group of pixels in order to perform segmentation tasks. The first important contribution of this paper refers to the development of a self-organizing method for data classification, named Enhanced Independent Component Analysis Mixture Model (EICAMM), which was built by proposing some modifications in the Independent Component Analysis Mixture Model (ICAMM). Such improvements were proposed by considering some of the model limitations as well as by analyzing how it should be improved in order to become more efficient. Moreover, a pre-processing methodology was also proposed, which is based on combining the Sparse Code Shrinkage (SCS) for image denoising and the Sobel edge detector. In the experiments of this work, the EICAMM and other self-organizing models were applied for segmenting images in their original and pre-processed versions. A comparative analysis showed satisfactory and competitive image segmentation results obtained by the proposals presented herein. (C) 2008 Published by Elsevier B.V.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this work we present a simulation of a recognition process with perimeter characterization of a simple plant leaves as a unique discriminating parameter. Data coding allowing for independence of leaves size and orientation may penalize performance recognition for some varieties. Border description sequences are then used to characterize the leaves. Independent Component Analysis (ICA) is then applied in order to study which is the best number of components to be considered for the classification task, implemented by means of an Artificial Neural Network (ANN). Obtained results with ICA as a pre-processing tool are satisfactory, and compared with some references our system improves the recognition success up to 80.8% depending on the number of considered independent components.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The Gram-Schmidt (GS) orthogonalisation procedure has been used to improve the convergence speed of least mean square (LMS) adaptive code-division multiple-access (CDMA) detectors. However, this algorithm updates two sets of parameters, namely the GS transform coefficients and the tap weights, simultaneously. Because of the additional adaptation noise introduced by the former, it is impossible to achieve the same performance as the ideal orthogonalised LMS filter, unlike the result implied in an earlier paper. The authors provide a lower bound on the minimum achievable mean squared error (MSE) as a function of the forgetting factor λ used in finding the GS transform coefficients, and propose a variable-λ algorithm to balance the conflicting requirements of good tracking and low misadjustment.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In this work an image pre-processing module has been developed to extract quantitative information from plantation images with various degrees of infestation. Four filters comprise this module: the first one acts on smoothness of the image, the second one removes image background enhancing plants leaves, the third filter removes isolated dots not removed by the previous filter, and the fourth one is used to highlight leaves' edges. At first the filters were tested with MATLAB, for a quick visual feedback of the filters' behavior. Then the filters were implemented in the C programming language. At last, the module as been coded in VHDL for the implementation on a Stratix II family FPGA. Tests were run and the results are shown in this paper. © 2008 Springer-Verlag Berlin Heidelberg.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Lymphoma is a type of cancer that affects the immune system, and is classified as Hodgkin or non-Hodgkin. It is one of the ten types of cancer that are the most common on earth. Among all malignant neoplasms diagnosed in the world, lymphoma ranges from three to four percent of them. Our work presents a study of some filters devoted to enhancing images of lymphoma at the pre-processing step. Here the enhancement is useful for removing noise from the digital images. We have analysed the noise caused by different sources like room vibration, scraps and defocusing, and in the following classes of lymphoma: follicular, mantle cell and B-cell chronic lymphocytic leukemia. The filters Gaussian, Median and Mean-Shift were applied to different colour models (RGB, Lab and HSV). Afterwards, we performed a quantitative analysis of the images by means of the Structural Similarity Index. This was done in order to evaluate the similarity between the images. In all cases we have obtained a certainty of at least 75%, which rises to 99% if one considers only HSV. Namely, we have concluded that HSV is an important choice of colour model at pre-processing histological images of lymphoma, because in this case the resulting image will get the best enhancement.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

La tesi propone una soluzione middleware per scenari in cui i sensori producono un numero elevato di dati che è necessario gestire ed elaborare attraverso operazioni di preprocessing, filtering e buffering al fine di migliorare l'efficienza di comunicazione e del consumo di banda nel rispetto di vincoli energetici e computazionali. E'possibile effettuare l'ottimizzazione di questi componenti attraverso operazioni di tuning remoto.