875 resultados para document clustering


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Partitional clustering algorithms, which partition the dataset into a pre-defined number of clusters, can be broadly classified into two types: algorithms which explicitly take the number of clusters as input and algorithms that take the expected size of a cluster as input. In this paper, we propose a variant of the k-means algorithm and prove that it is more efficient than standard k-means algorithms. An important contribution of this paper is the establishment of a relation between the number of clusters and the size of the clusters in a dataset through the analysis of our algorithm. We also demonstrate that the integration of this algorithm as a pre-processing step in classification algorithms reduces their running-time complexity.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In this paper, we present a new feature-based approach for mosaicing of camera-captured document images. A novel block-based scheme is employed to ensure that corners can be reliably detected over a wide range of images. 2-D discrete cosine transform is computed for image blocks defined around each of the detected corners and a small subset of the coefficients is used as a feature vector A 2-pass feature matching is performed to establish point correspondences from which the homography relating the input images could be computed. The algorithm is tested on a number of complex document images casually taken from a hand-held camera yielding convincing results.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Skew correction of complex document images is a difficult task. We propose an edge-based connected component approach for robust skew correction of documents with complex layout and content. The algorithm essentially consists of two steps - an 'initialization' step to determine the image orientation from the centroids of the connected components and a 'search' step to find the actual skew of the image. During initialization, we choose two different sets of points regularly spaced across the the image, one from the left to right and the other from top to bottom. The image orientation is determined from the slope between the two succesive nearest neighbors of each of the points in the chosen set. The search step finds succesive nearest neighbors that satisfy the parameters obtained in the initialization step. The final skew is determined from the slopes obtained in the 'search' step. Unlike other connected component based methods, the proposed method does not require any binarization step that generally precedes connected component analysis. The method works well for scanned documents with complex layout of any skew with a precision of 0.5 degrees.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The document images that are fed into an Optical Character Recognition system, might be skewed. This could be due to improper feeding of the document into the scanner or may be due to a faulty scanner. In this paper, we propose a skew detection and correction method for document images. We make use of the inherent randomness in the Horizontal Projection profiles of a text block image, as the skew of the image varies. The proposed algorithm has proved to be very robust and time efficient. The entire process takes less than a second on a 2.4 GHz Pentium IV PC.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The k-means algorithm is an extremely popular technique for clustering data. One of the major limitations of the k-means is that the time to cluster a given dataset D is linear in the number of clusters, k. In this paper, we employ height balanced trees to address this issue. Specifically, we make two major contributions, (a) we propose an algorithm, RACK (acronym for RApid Clustering using k-means), which takes time favorably comparable with the fastest known existing techniques, and (b) we prove an expected bound on the quality of clustering achieved using RACK. Our experimental results on large datasets strongly suggest that RACK is competitive with the k-means algorithm in terms of quality of clustering, while taking significantly less time.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The keyword based search technique suffers from the problem of synonymic and polysemic queries. Current approaches address only theproblem of synonymic queries in which different queries might have the same information requirement. But the problem of polysemic queries,i.e., same query having different intentions, still remains unaddressed. In this paper, we propose the notion of intent clusters, the members of which will have the same intention. We develop a clustering algorithm that uses the user session information in query logs in addition to query URL entries to identify cluster of queries having the same intention. The proposed approach has been studied through case examples from the actual log data from AOL, and the clustering algorithm is shown to be successful in discerning the user intentions.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Resistometric studies of isochronal and isothermal annealing of an Al-0.64 at.% Ag alloy have given a value of 0.13 ± 0.02 eV for the silver-vacancy binding energy and 0.55 ± 0.03 eV for the migration energy of solute atoms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The influence of 0.03 and 0.08 at. % Ag additions on the clustering of Zn atoms in an Al-4.4 at. % Zn alloy has been studied by resistometry. The effect of quenching and ageing temperatures shows that the ageing-ratio method of calculating the vacancy-solute atom binding energy is not applicable to these alloys. Zone-formation in Al-Zn is unaffected by Ag additions, but the zone-reversion process seems to be influenced. Apparent vacancy-formation energies in the binary and ternary alloys have been used to evaluate the v-Ag atom binding energy as 0.21 eV. It is proposed that, Ag and Zn being similar in size, the relative vacancy binding results from valency effects, and that in Al-Zn-Ag alloys clusters of Zn and Ag may form simultaneously, unaffected by the presence of each other. © 1970 Chapman and Hall Ltd.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Isochronal and isothermal ageing experiments have been carried out to determine the influence of 0.01 at. % addition of a second solute on the clustering rate in the quenched Al-4,4 a/o Zn alloy. The influence of quenching and ageing temperatures has been interpreted to obtain the apparent vacancy formation and vacancy migration energies in the various ternary alloys. Using a vacancy-aided clustering model the following values of binding free energy have been evaluated: Ce-0.18; Dy-0.24; Fe-0.18; Li-0.25; Mn-0.27; Nb-0.18; Pt-0.23; Sb-0.21; Si-0.30; Y-0.25; and Yb-0.23 (± 0.02 eV). These binding energy values refer to that between a solute atom and a single vacancy. The values of vacancy migration energy (c. 0.4 eV) and the experimental activation energy for solute diffusion (c. 1.1 eV) are unaffected by the presence of the ternary atoms in the Al-Zn alloy.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Al-4.4 a/oZn and Al-4.4 a/oZn with Ag, Ce, Dy, Li, Nb, Pt, Y, or Yb, alloys have been investigated by resistometry with a view to study the solute-vacancy interactions and clustering kinetics in these alloys. Solute-vacancy binding energies have been evaluated for all these elements by making use of appropriate methods of evaluation. Ag and Dy additions yield some interesting results and these have been discussed in the thesis. Solute-vacancy binding energy values obtained here have been compared with other available values and discussed. A study of the type of interaction between vacancies and solute atoms indicates that the valency effect is more predominant than the elastic effect.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Electronic document management (EDM) technology has the potential to enhance the information management in construction projects considerably, without radical changes to current practice. Over the past fifteen years this topic has been overshadowed by building product modelling in the construction IT research world, but at present EDM is quickly being introduced in practice, in particular in bigger projects. Often this is done in the form of third party services available over the World Wide Web. In the paper, a typology of research questions and methods is presented, which can be used to position the individual research efforts which are surveyed in the paper. Questions dealt with include: What features should EMD systems have? How much are they used? Are there benefits from use and how should these be measured? What are the barriers to wide-spread adoption? Which technical questions need to be solved? Is there scope for standardisation? How will the market for such systems evolve?

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Triggered by the very quick proliferation of Internet connectivity, electronic document management (EDM) systems are now rapidly being adopted for managing the documentation that is produced and exchanged in construction projects. Nevertheless there are still substantial barriers to the efficient use of such systems, mainly of a psychological nature and related to insufficient training. This paper presents the results of empirical studies carried out during 2002 concerning the current usage of EDM systems in the Finnish construction industry. The studies employed three different methods in order to provide a multifaceted view of the problem area, both on the industry and individual project level. In order to provide an accurate measurement of overall usage volume in the industry as a whole telephone interviews with key personnel from 100 randomly chosen construction projects were conducted. The interviews showed that while around 1/3 of big projects already have adopted the use of EDM, very few small projects have adopted this technology. The barriers to introduction were investigated through interviews with representatives for half a dozen of providers of systems and ASP-services. These interviews shed a lot of light on the dynamics of the market for this type of services and illustrated the diversity of business strategies adopted by vendors. In the final study log files from a project which had used an EDM system were analysed in order to determine usage patterns. The results illustrated that use is yet incomplete in coverage and that only a part of the individuals involved in the project used the system efficiently, either as information producers or consumers. The study also provided feedback on the usefulness of the log files.