954 resultados para Noisy data
Resumo:
In a world where massive amounts of data are recorded on a large scale we need data mining technologies to gain knowledge from the data in a reasonable time. The Top Down Induction of Decision Trees (TDIDT) algorithm is a very widely used technology to predict the classification of newly recorded data. However alternative technologies have been derived that often produce better rules but do not scale well on large datasets. Such an alternative to TDIDT is the PrismTCS algorithm. PrismTCS performs particularly well on noisy data but does not scale well on large datasets. In this paper we introduce Prism and investigate its scaling behaviour. We describe how we improved the scalability of the serial version of Prism and investigate its limitations. We then describe our work to overcome these limitations by developing a framework to parallelise algorithms of the Prism family and similar algorithms. We also present the scale up results of a first prototype implementation.
Resumo:
A two-stage linear-in-the-parameter model construction algorithm is proposed aimed at noisy two-class classification problems. The purpose of the first stage is to produce a prefiltered signal that is used as the desired output for the second stage which constructs a sparse linear-in-the-parameter classifier. The prefiltering stage is a two-level process aimed at maximizing a model's generalization capability, in which a new elastic-net model identification algorithm using singular value decomposition is employed at the lower level, and then, two regularization parameters are optimized using a particle-swarm-optimization algorithm at the upper level by minimizing the leave-one-out (LOO) misclassification rate. It is shown that the LOO misclassification rate based on the resultant prefiltered signal can be analytically computed without splitting the data set, and the associated computational cost is minimal due to orthogonality. The second stage of sparse classifier construction is based on orthogonal forward regression with the D-optimality algorithm. Extensive simulations of this approach for noisy data sets illustrate the competitiveness of this approach to classification of noisy data problems.
Resumo:
Ensemble learning can be used to increase the overall classification accuracy of a classifier by generating multiple base classifiers and combining their classification results. A frequently used family of base classifiers for ensemble learning are decision trees. However, alternative approaches can potentially be used, such as the Prism family of algorithms that also induces classification rules. Compared with decision trees, Prism algorithms generate modular classification rules that cannot necessarily be represented in the form of a decision tree. Prism algorithms produce a similar classification accuracy compared with decision trees. However, in some cases, for example, if there is noise in the training and test data, Prism algorithms can outperform decision trees by achieving a higher classification accuracy. However, Prism still tends to overfit on noisy data; hence, ensemble learners have been adopted in this work to reduce the overfitting. This paper describes the development of an ensemble learner using a member of the Prism family as the base classifier to reduce the overfitting of Prism algorithms on noisy datasets. The developed ensemble classifier is compared with a stand-alone Prism classifier in terms of classification accuracy and resistance to noise.
Resumo:
A novel two-stage construction algorithm for linear-in-the-parameters classifier is proposed, aiming at noisy two-class classification problems. The purpose of the first stage is to produce a prefiltered signal that is used as the desired output for the second stage to construct a sparse linear-in-the-parameters classifier. For the first stage learning of generating the prefiltered signal, a two-level algorithm is introduced to maximise the model's generalisation capability, in which an elastic net model identification algorithm using singular value decomposition is employed at the lower level while the two regularisation parameters are selected by maximising the Bayesian evidence using a particle swarm optimization algorithm. Analysis is provided to demonstrate how “Occam's razor” is embodied in this approach. The second stage of sparse classifier construction is based on an orthogonal forward regression with the D-optimality algorithm. Extensive experimental results demonstrate that the proposed approach is effective and yields competitive results for noisy data sets.
Resumo:
tWe develop an orthogonal forward selection (OFS) approach to construct radial basis function (RBF)network classifiers for two-class problems. Our approach integrates several concepts in probabilisticmodelling, including cross validation, mutual information and Bayesian hyperparameter fitting. At eachstage of the OFS procedure, one model term is selected by maximising the leave-one-out mutual infor-mation (LOOMI) between the classifier’s predicted class labels and the true class labels. We derive theformula of LOOMI within the OFS framework so that the LOOMI can be evaluated efficiently for modelterm selection. Furthermore, a Bayesian procedure of hyperparameter fitting is also integrated into theeach stage of the OFS to infer the l2-norm based local regularisation parameter from the data. Since eachforward stage is effectively fitting of a one-variable model, this task is very fast. The classifier construc-tion procedure is automatically terminated without the need of using additional stopping criterion toyield very sparse RBF classifiers with excellent classification generalisation performance, which is par-ticular useful for the noisy data sets with highly overlapping class distribution. A number of benchmarkexamples are employed to demonstrate the effectiveness of our proposed approach.
Resumo:
In this paper we present a novel approach for multispectral image contextual classification by combining iterative combinatorial optimization algorithms. The pixel-wise decision rule is defined using a Bayesian approach to combine two MRF models: a Gaussian Markov Random Field (GMRF) for the observations (likelihood) and a Potts model for the a priori knowledge, to regularize the solution in the presence of noisy data. Hence, the classification problem is stated according to a Maximum a Posteriori (MAP) framework. In order to approximate the MAP solution we apply several combinatorial optimization methods using multiple simultaneous initializations, making the solution less sensitive to the initial conditions and reducing both computational cost and time in comparison to Simulated Annealing, often unfeasible in many real image processing applications. Markov Random Field model parameters are estimated by Maximum Pseudo-Likelihood (MPL) approach, avoiding manual adjustments in the choice of the regularization parameters. Asymptotic evaluations assess the accuracy of the proposed parameter estimation procedure. To test and evaluate the proposed classification method, we adopt metrics for quantitative performance assessment (Cohen`s Kappa coefficient), allowing a robust and accurate statistical analysis. The obtained results clearly show that combining sub-optimal contextual algorithms significantly improves the classification performance, indicating the effectiveness of the proposed methodology. (C) 2010 Elsevier B.V. All rights reserved.
Resumo:
Traditional applications of feature selection in areas such as data mining, machine learning and pattern recognition aim to improve the accuracy and to reduce the computational cost of the model. It is done through the removal of redundant, irrelevant or noisy data, finding a representative subset of data that reduces its dimensionality without loss of performance. With the development of research in ensemble of classifiers and the verification that this type of model has better performance than the individual models, if the base classifiers are diverse, comes a new field of application to the research of feature selection. In this new field, it is desired to find diverse subsets of features for the construction of base classifiers for the ensemble systems. This work proposes an approach that maximizes the diversity of the ensembles by selecting subsets of features using a model independent of the learning algorithm and with low computational cost. This is done using bio-inspired metaheuristics with evaluation filter-based criteria
Resumo:
In der vorliegenden Arbeit wird die Faktorisierungsmethode zur Erkennung von Inhomogenitäten der Leitfähigkeit in der elektrischen Impedanztomographie auf unbeschränkten Gebieten - speziell der Halbebene bzw. dem Halbraum - untersucht. Als Lösungsräume für das direkte Problem, d.h. die Bestimmung des elektrischen Potentials zu vorgegebener Leitfähigkeit und zu vorgegebenem Randstrom, führen wir gewichtete Sobolev-Räume ein. In diesen wird die Existenz von schwachen Lösungen des direkten Problems gezeigt und die Gültigkeit einer Integraldarstellung für die Lösung der Laplace-Gleichung, die man bei homogener Leitfähigkeit erhält, bewiesen. Mittels der Faktorisierungsmethode geben wir eine explizite Charakterisierung von Einschlüssen an, die gegenüber dem Hintergrund eine sprunghaft erhöhte oder erniedrigte Leitfähigkeit haben. Damit ist zugleich für diese Klasse von Leitfähigkeiten die eindeutige Rekonstruierbarkeit der Einschlüsse bei Kenntnis der lokalen Neumann-Dirichlet-Abbildung gezeigt. Die mittels der Faktorisierungsmethode erhaltene Charakterisierung der Einschlüsse haben wir in ein numerisches Verfahren umgesetzt und sowohl im zwei- als auch im dreidimensionalen Fall mit simulierten, teilweise gestörten Daten getestet. Im Gegensatz zu anderen bekannten Rekonstruktionsverfahren benötigt das hier vorgestellte keine Vorabinformation über Anzahl und Form der Einschlüsse und hat als nicht-iteratives Verfahren einen vergleichsweise geringen Rechenaufwand.
Resumo:
We consider the problem of fitting a union of subspaces to a collection of data points drawn from one or more subspaces and corrupted by noise and/or gross errors. We pose this problem as a non-convex optimization problem, where the goal is to decompose the corrupted data matrix as the sum of a clean and self-expressive dictionary plus a matrix of noise and/or gross errors. By self-expressive we mean a dictionary whose atoms can be expressed as linear combinations of themselves with low-rank coefficients. In the case of noisy data, our key contribution is to show that this non-convex matrix decomposition problem can be solved in closed form from the SVD of the noisy data matrix. The solution involves a novel polynomial thresholding operator on the singular values of the data matrix, which requires minimal shrinkage. For one subspace, a particular case of our framework leads to classical PCA, which requires no shrinkage. For multiple subspaces, the low-rank coefficients obtained by our framework can be used to construct a data affinity matrix from which the clustering of the data according to the subspaces can be obtained by spectral clustering. In the case of data corrupted by gross errors, we solve the problem using an alternating minimization approach, which combines our polynomial thresholding operator with the more traditional shrinkage-thresholding operator. Experiments on motion segmentation and face clustering show that our framework performs on par with state-of-the-art techniques at a reduced computational cost.
Resumo:
This paper proposes a method for the identification of different partial discharges (PDs) sources through the analysis of a collection of PD signals acquired with a PD measurement system. This method, robust and sensitive enough to cope with noisy data and external interferences, combines the characterization of each signal from the collection, with a clustering procedure, the CLARA algorithm. Several features are proposed for the characterization of the signals, being the wavelet variances, the frequency estimated with the Prony method, and the energy, the most relevant for the performance of the clustering procedure. The result of the unsupervised classification is a set of clusters each containing those signals which are more similar to each other than to those in other clusters. The analysis of the classification results permits both the identification of different PD sources and the discrimination between original PD signals, reflections, noise and external interferences. The methods and graphical tools detailed in this paper have been coded and published as a contributed package of the R environment under a GNU/GPL license.
Resumo:
This paper proposes a method for the identification of different partial discharges (PDs) sources through the analysis of a collection of PD signals acquired with a PD measurement system. This method, robust and sensitive enough to cope with noisy data and external interferences, combines the characterization of each signal from the collection, with a clustering procedure, the CLARA algorithm. Several features are proposed for the characterization of the signals, being the wavelet variances, the frequency estimated with the Prony method, and the energy, the most relevant for the performance of the clustering procedure. The result of the unsupervised classification is a set of clusters each containing those signals which are more similar to each other than to those in other clusters. The analysis of the classification results permits both the identification of different PD sources and the discrimination between original PD signals, reflections, noise and external interferences. The methods and graphical tools detailed in this paper have been coded and published as a contributed package of the R environment under a GNU/GPL license.
Resumo:
Deterministic chaos has been implicated in numerous natural and man-made complex phenomena ranging from quantum to astronomical scales and in disciplines as diverse as meteorology, physiology, ecology, and economics. However, the lack of a definitive test of chaos vs. random noise in experimental time series has led to considerable controversy in many fields. Here we propose a numerical titration procedure as a simple “litmus test” for highly sensitive, specific, and robust detection of chaos in short noisy data without the need for intensive surrogate data testing. We show that the controlled addition of white or colored noise to a signal with a preexisting noise floor results in a titration index that: (i) faithfully tracks the onset of deterministic chaos in all standard bifurcation routes to chaos; and (ii) gives a relative measure of chaos intensity. Such reliable detection and quantification of chaos under severe conditions of relatively low signal-to-noise ratio is of great interest, as it may open potential practical ways of identifying, forecasting, and controlling complex behaviors in a wide variety of physical, biomedical, and socioeconomic systems.
Resumo:
Many applications including object reconstruction, robot guidance, and. scene mapping require the registration of multiple views from a scene to generate a complete geometric and appearance model of it. In real situations, transformations between views are unknown and it is necessary to apply expert inference to estimate them. In the last few years, the emergence of low-cost depth-sensing cameras has strengthened the research on this topic, motivating a plethora of new applications. Although they have enough resolution and accuracy for many applications, some situations may not be solved with general state-of-the-art registration methods due to the signal-to-noise ratio (SNR) and the resolution of the data provided. The problem of working with low SNR data, in general terms, may appear in any 3D system, then it is necessary to propose novel solutions in this aspect. In this paper, we propose a method, μ-MAR, able to both coarse and fine register sets of 3D points provided by low-cost depth-sensing cameras, despite it is not restricted to these sensors, into a common coordinate system. The method is able to overcome the noisy data problem by means of using a model-based solution of multiplane registration. Specifically, it iteratively registers 3D markers composed by multiple planes extracted from points of multiple views of the scene. As the markers and the object of interest are static in the scenario, the transformations obtained for the markers are applied to the object in order to reconstruct it. Experiments have been performed using synthetic and real data. The synthetic data allows a qualitative and quantitative evaluation by means of visual inspection and Hausdorff distance respectively. The real data experiments show the performance of the proposal using data acquired by a Primesense Carmine RGB-D sensor. The method has been compared to several state-of-the-art methods. The results show the good performance of the μ-MAR to register objects with high accuracy in presence of noisy data outperforming the existing methods.
Resumo:
La scoliose idiopathique de l’adolescent (SIA) est une déformation tridimensionnelle (3D) de la colonne vertébrale. Pour la plupart des patients atteints de SIA, aucun traitement chirurgical n’est nécessaire. Lorsque la déformation devient sévère, un traitement chirurgical visant à réduire la déformation est recommandé. Pour déterminer la sévérité de la SIA, l’imagerie la plus utilisée est une radiographie postéroantérieure (PA) ou antéro-postérieure (AP) du rachis. Plusieurs indices sont disponibles à partir de cette modalité d’imagerie afin de quantifier la déformation de la SIA, dont l’angle de Cobb. La conduite thérapeutique est généralement basée sur cet indice. Cependant, les indices disponibles à cette modalité d’imagerie sont de nature bidimensionnelle (2D). Celles-ci ne décrivent donc pas entièrement la déformation dans la SIA dû à sa nature tridimensionnelle (3D). Conséquemment, les classifications basées sur les indices 2D souffrent des mêmes limitations. Dans le but décrire la SIA en 3D, la torsion géométrique a été étudiée et proposée par Poncet et al. Celle-ci mesure la tendance d’une courbe tridimensionnelle à changer de direction. Cependant, la méthode proposée est susceptible aux erreurs de reconstructions 3D et elle est calculée localement au niveau vertébral. L’objectif de cette étude est d’évaluer une nouvelle méthode d’estimation de la torsion géométrique par l’approximation de longueurs d’arcs locaux et par paramétrisation de courbes dans la SIA. Une première étude visera à étudier la sensibilité de la nouvelle méthode présentée face aux erreurs de reconstructions 3D du rachis. Par la suite, deux études cliniques vont présenter la iv torsion géométrique comme indice global et viseront à démontrer l’existence de sous-groupes non-identifiés dans les classifications actuelles et que ceux-ci ont une pertinence clinique. La première étude a évalué la robustesse de la nouvelle méthode d’estimation de la torsion géométrique chez un groupe de patient atteint de la SIA. Elle a démontré que la nouvelle technique est robuste face aux erreurs de reconstructions 3D du rachis. La deuxième étude a évalué la torsion géométrique utilisant cette nouvelle méthode dans une cohorte de patient avec des déformations de type Lenke 1. Elle a démontré qu’il existe deux sous-groupes, une avec des valeurs de torsion élevées et l’autre avec des valeurs basses. Ces deux sous-groupes possèdent des différences statistiquement significatives, notamment au niveau du rachis lombaire avec le groupe de torsion élevée ayant des valeurs d’orientation des plans de déformation maximales (PMC) en thoraco-lombaire (TLL) plus élevées. La dernière étude a évalué les résultats chirurgicaux de patients ayant une déformation Lenke 1 sous-classifiées selon les valeurs de torsion préalablement. Cette étude a pu démontrer des différences au niveau du PMC au niveau thoraco-lombaire avec des valeurs plus élevées en postopératoire chez les patients ayant une haute torsion. Ces études présentent une nouvelle méthode d’estimation de la torsion géométrique et présentent cet indice quantitativement. Elles ont démontré l’existence de sous-groupes 3D basés sur cet indice ayant une pertinence clinique dans la SIA, qui n’étaient pas identifiés auparavant. Ce projet contribue dans la tendance actuelle vers le développement d’indices 3D et de classifications 3D pour la scoliose idiopathique de l’adolescent.
Resumo:
We investigate the problem of determining the stationary temperature field on an inclusion from given Cauchy data on an accessible exterior boundary. On this accessible part the temperature (or the heat flux) is known, and, additionally, on a portion of this exterior boundary the heat flux (or temperature) is also given. We propose a direct boundary integral approach in combination with Tikhonov regularization for the stable determination of the temperature and flux on the inclusion. To determine these quantities on the inclusion, boundary integral equations are derived using Green’s functions, and properties of these equations are shown in an L2-setting. An effective way of discretizing these boundary integral equations based on the Nystr¨om method and trigonometric approximations, is outlined. Numerical examples are included, both with exact and noisy data, showing that accurate approximations can be obtained with small computational effort, and the accuracy is increasing with the length of the portion of the boundary where the additionally data is given.