5 resultados para Semi-supervised clustering

em Massachusetts Institute of Technology


Relevância:

80.00% 80.00%

Publicador:

Resumo:

An increasing number of parameter estimation tasks involve the use of at least two information sources, one complete but limited, the other abundant but incomplete. Standard algorithms such as EM (or em) used in this context are unfortunately not stable in the sense that they can lead to a dramatic loss of accuracy with the inclusion of incomplete observations. We provide a more controlled solution to this problem through differential equations that govern the evolution of locally optimal solutions (fixed points) as a function of the source weighting. This approach permits us to explicitly identify any critical (bifurcation) points leading to choices unsupported by the available complete data. The approach readily applies to any graphical model in O(n^3) time where n is the number of parameters. We use the naive Bayes model to illustrate these ideas and demonstrate the effectiveness of our approach in the context of text classification problems.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Co-training is a semi-supervised learning method that is designed to take advantage of the redundancy that is present when the object to be identified has multiple descriptions. Co-training is known to work well when the multiple descriptions are conditional independent given the class of the object. The presence of multiple descriptions of objects in the form of text, images, audio and video in multimedia applications appears to provide redundancy in the form that may be suitable for co-training. In this paper, we investigate the suitability of utilizing text and image data from the Web for co-training. We perform measurements to find indications of conditional independence in the texts and images obtained from the Web. Our measurements suggest that conditional independence is likely to be present in the data. Our experiments, within a relevance feedback framework to test whether a method that exploits the conditional independence outperforms methods that do not, also indicate that better performance can indeed be obtained by designing algorithms that exploit this form of the redundancy when it is present.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we present an approach to discretizing multivariate continuous data while learning the structure of a graphical model. We derive the joint scoring function from the principle of predictive accuracy, which inherently ensures the optimal trade-off between goodness of fit and model complexity (including the number of discretization levels). Using the so-called finest grid implied by the data, our scoring function depends only on the number of data points in the various discretization levels. Not only can it be computed efficiently, but it is also independent of the metric used in the continuous space. Our experiments with gene expression data show that discretization plays a crucial role regarding the resulting network structure.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We describe a system that learns from examples to recognize people in images taken indoors. Images of people are represented by color-based and shape-based features. Recognition is carried out through combinations of Support Vector Machine classifiers (SVMs). Different types of multiclass strategies based on SVMs are explored and compared to k-Nearest Neighbors classifiers (kNNs). The system works in real time and shows high performance rates for people recognition throughout one day.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We study four measures of problem instance behavior that might account for the observed differences in interior-point method (IPM) iterations when these methods are used to solve semidefinite programming (SDP) problem instances: (i) an aggregate geometry measure related to the primal and dual feasible regions (aspect ratios) and norms of the optimal solutions, (ii) the (Renegar-) condition measure C(d) of the data instance, (iii) a measure of the near-absence of strict complementarity of the optimal solution, and (iv) the level of degeneracy of the optimal solution. We compute these measures for the SDPLIB suite problem instances and measure the correlation between these measures and IPM iteration counts (solved using the software SDPT3) when the measures have finite values. Our conclusions are roughly as follows: the aggregate geometry measure is highly correlated with IPM iterations (CORR = 0.896), and is a very good predictor of IPM iterations, particularly for problem instances with solutions of small norm and aspect ratio. The condition measure C(d) is also correlated with IPM iterations, but less so than the aggregate geometry measure (CORR = 0.630). The near-absence of strict complementarity is weakly correlated with IPM iterations (CORR = 0.423). The level of degeneracy of the optimal solution is essentially uncorrelated with IPM iterations.