75 resultados para document clustering

em Indian Institute of Science - Bangalore - Índia


Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, we present a methodology for identifying best features from a large feature space. In high dimensional feature space nearest neighbor search is meaningless. In this feature space we see quality and performance issue with nearest neighbor search. Many data mining algorithms use nearest neighbor search. So instead of doing nearest neighbor search using all the features we need to select relevant features. We propose feature selection using Non-negative Matrix Factorization(NMF) and its application to nearest neighbor search. Recent clustering algorithm based on Locally Consistent Concept Factorization(LCCF) shows better quality of document clustering by using local geometrical and discriminating structure of the data. By using our feature selection method we have shown further improvement of performance in the clustering.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

When document corpus is very large, we often need to reduce the number of features. But it is not possible to apply conventional Non-negative Matrix Factorization(NMF) on billion by million matrix as the matrix may not fit in memory. Here we present novel Online NMF algorithm. Using Online NMF, we reduced original high-dimensional space to low-dimensional space. Then we cluster all the documents in reduced dimension using k-means algorithm. We experimentally show that by processing small subsets of documents we will be able to achieve good performance. The method proposed outperforms existing algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The work reported hen was motivated by a desire to verify the existence of structure - specifically MP-rich clusters induced by sodium bromide (NaBr) in the ternary liquid mixture 3-methylpyridine (Mf) + water(W) + NaBr. We present small-angle X-ray scattering (SAXS) measurements in this mixture. These measurements were obtained at room temperature (similar to 298 K) in the one-phase region (below the relevant lower consolute points, T(L)s) at different values of X (i.e., X = 0.02 - 0.17), where X is the weight fraction of NaBr in the mixture. Cluster-size distribution, estimated on the assumption that the clusters are spherical, shows systematic behaviour in that the peak of the distribution shifts rewards larger values of cluster radius as X increases. The largest spatial extent of the clusters (similar to 4.5 nm) is seen at X = 0.17. Data analysis assuming arbitrary shapes and sizes of clusters gives a limiting value of cluster size (- 4.5 nm) that is not very sensitive to X. It is suggested that the cluster size determined may not be the same as the usual critical-point fluctuations far removed from the critical point (T-L). The influence of the additional length scale due to clustering is discussed from the standpoint of crossover from Ising to mean-field critical behaviour, when moving away from the T-L.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a robust method for mosaicing of document images using features derived from connected components. Each connected component is described using the Angular Radial Tran. form (ART). To ensure geometric consistency during feature matching, the ART coefficients of a connected component are augmented with those of its two nearest neighbors. The proposed method addresses two critical issues often encountered in correspondence matching: (i) The stability of features and (ii) Robustness against false matches due to the multiple instances of characters in a document image. The use of connected components guarantees a stable localization across images. The augmented features ensure a successful correspondence matching even in the presence of multiple similar regions within the page. We illustrate the effectiveness of the proposed method on camera captured document images exhibiting large variations in viewpoint, illumination and scale.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

n this paper, a multistage evolutionary scheme is proposed for clustering in a large data base, like speech data. This is achieved by clustering a small subset of the entire sample set in each stage and treating the cluster centroids so obtained as samples, together with another subset of samples not considered previously, as input data to the next stage. This is continued till the whole sample set is exhausted. The clustering is accomplished by constructing a fuzzy similarity matrix and using the fuzzy techniques proposed here. The technique is illustrated by an efficient scheme for voiced-unvoiced-silence classification of speech.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper the notion of conceptual cohesiveness is precised and used to group objects semantically, based on a knowledge structure called ‘cohesion forest’. A set of axioms is proposed which should be satisfied to make the generated clusters meaningful.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A computationally efficient agglomerative clustering algorithm based on multilevel theory is presented. Here, the data set is divided randomly into a number of partitions. The samples of each such partition are clustered separately using hierarchical agglomerative clustering algorithm to form sub-clusters. These are merged at higher levels to get the final classification. This algorithm leads to the same classification as that of hierarchical agglomerative clustering algorithm when the clusters are well separated. The advantages of this algorithm are short run time and small storage requirement. It is observed that the savings, in storage space and computation time, increase nonlinearly with the sample size.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

K-means algorithm is a well known nonhierarchical method for clustering data. The most important limitations of this algorithm are that: (1) it gives final clusters on the basis of the cluster centroids or the seed points chosen initially, and (2) it is appropriate for data sets having fairly isotropic clusters. But this algorithm has the advantage of low computation and storage requirements. On the other hand, hierarchical agglomerative clustering algorithm, which can cluster nonisotropic (chain-like and concentric) clusters, requires high storage and computation requirements. This paper suggests a new method for selecting the initial seed points, so that theK-means algorithm gives the same results for any input data order. This paper also describes a hybrid clustering algorithm, based on the concepts of multilevel theory, which is nonhierarchical at the first level and hierarchical from second level onwards, to cluster data sets having (i) chain-like clusters and (ii) concentric clusters. It is observed that this hybrid clustering algorithm gives the same results as the hierarchical clustering algorithm, with less computation and storage requirements.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The concept of feature selection in a nonparametric unsupervised learning environment is practically undeveloped because no true measure for the effectiveness of a feature exists in such an environment. The lack of a feature selection phase preceding the clustering process seriously affects the reliability of such learning. New concepts such as significant features, level of significance of features, and immediate neighborhood are introduced which result in meeting implicitly the need for feature slection in the context of clustering techniques.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The concept of feature selection in a nonparametric unsupervised learning environment is practically undeveloped because no true measure for the effectiveness of a feature exists in such an environment. The lack of a feature selection phase preceding the clustering process seriously affects the reliability of such learning. New concepts such as significant features, level of significance of features, and immediate neighborhood are introduced which result in meeting implicitly the need for feature slection in the context of clustering techniques.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

A new clustering technique, based on the concept of immediato neighbourhood, with a novel capability to self-learn the number of clusters expected in the unsupervized environment, has been developed. The method compares favourably with other clustering schemes based on distance measures, both in terms of conceptual innovations and computational economy. Test implementation of the scheme using C-l flight line training sample data in a simulated unsupervized mode has brought out the efficacy of the technique. The technique can easily be implemented as a front end to established pattern classification systems with supervized learning capabilities to derive unified learning systems capable of operating in both supervized and unsupervized environments. This makes the technique an attractive proposition in the context of remotely sensed earth resources data analysis wherein it is essential to have such a unified learning system capability.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Partitional clustering algorithms, which partition the dataset into a pre-defined number of clusters, can be broadly classified into two types: algorithms which explicitly take the number of clusters as input and algorithms that take the expected size of a cluster as input. In this paper, we propose a variant of the k-means algorithm and prove that it is more efficient than standard k-means algorithms. An important contribution of this paper is the establishment of a relation between the number of clusters and the size of the clusters in a dataset through the analysis of our algorithm. We also demonstrate that the integration of this algorithm as a pre-processing step in classification algorithms reduces their running-time complexity.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we present a new feature-based approach for mosaicing of camera-captured document images. A novel block-based scheme is employed to ensure that corners can be reliably detected over a wide range of images. 2-D discrete cosine transform is computed for image blocks defined around each of the detected corners and a small subset of the coefficients is used as a feature vector A 2-pass feature matching is performed to establish point correspondences from which the homography relating the input images could be computed. The algorithm is tested on a number of complex document images casually taken from a hand-held camera yielding convincing results.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Skew correction of complex document images is a difficult task. We propose an edge-based connected component approach for robust skew correction of documents with complex layout and content. The algorithm essentially consists of two steps - an 'initialization' step to determine the image orientation from the centroids of the connected components and a 'search' step to find the actual skew of the image. During initialization, we choose two different sets of points regularly spaced across the the image, one from the left to right and the other from top to bottom. The image orientation is determined from the slope between the two succesive nearest neighbors of each of the points in the chosen set. The search step finds succesive nearest neighbors that satisfy the parameters obtained in the initialization step. The final skew is determined from the slopes obtained in the 'search' step. Unlike other connected component based methods, the proposed method does not require any binarization step that generally precedes connected component analysis. The method works well for scanned documents with complex layout of any skew with a precision of 0.5 degrees.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The document images that are fed into an Optical Character Recognition system, might be skewed. This could be due to improper feeding of the document into the scanner or may be due to a faulty scanner. In this paper, we propose a skew detection and correction method for document images. We make use of the inherent randomness in the Horizontal Projection profiles of a text block image, as the skew of the image varies. The proposed algorithm has proved to be very robust and time efficient. The entire process takes less than a second on a 2.4 GHz Pentium IV PC.