42 resultados para Web documents


Relevância:

20.00% 20.00%

Publicador:

Resumo:

We propose a set of metrics that evaluate the uniformity, sharpness, continuity, noise, stroke width variance,pulse width ratio, transient pixels density, entropy and variance of components to quantify the quality of a document image. The measures are intended to be used in any optical character recognition (OCR) engine to a priori estimate the expected performance of the OCR. The suggested measures have been evaluated on many document images, which have different scripts. The quality of a document image is manually annotated by users to create a ground truth. The idea is to correlate the values of the measures with the user annotated data. If the measure calculated matches the annotated description,then the metric is accepted; else it is rejected. In the set of metrics proposed, some of them are accepted and the rest are rejected. We have defined metrics that are easily estimatable. The metrics proposed in this paper are based on the feedback of homely grown OCR engines for Indic (Tamil and Kannada) languages. The metrics are independent of the scripts, and depend only on the quality and age of the paper and the printing. Experiments and results for each proposed metric are discussed. Actual recognition of the printed text is not performed to evaluate the proposed metrics. Sometimes, a document image containing broken characters results in good document image as per the evaluated metrics, which is part of the unsolved challenges. The proposed measures work on gray scale document images and fail to provide reliable information on binarized document image.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

When document corpus is very large, we often need to reduce the number of features. But it is not possible to apply conventional Non-negative Matrix Factorization(NMF) on billion by million matrix as the matrix may not fit in memory. Here we present novel Online NMF algorithm. Using Online NMF, we reduced original high-dimensional space to low-dimensional space. Then we cluster all the documents in reduced dimension using k-means algorithm. We experimentally show that by processing small subsets of documents we will be able to achieve good performance. The method proposed outperforms existing algorithms.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Residue depth accurately measures burial and parameterizes local protein environment. Depth is the distance of any atom/residue to the closest bulk water. We consider the non-bulk waters to occupy cavities, whose volumes are determined using a Voronoi procedure. Our estimation of cavity sizes is statistically superior to estimates made by CASTp and VOIDOO, and on par with McVol over a data set of 40 cavities. Our calculated cavity volumes correlated best with the experimentally determined destabilization of 34 mutants from five proteins. Some of the cavities identified are capable of binding small molecule ligands. In this study, we have enhanced our depth-based predictions of binding sites by including evolutionary information. We have demonstrated that on a database (LigASite) of similar to 200 proteins, we perform on par with ConCavity and better than MetaPocket 2.0. Our predictions, while less sensitive, are more specific and precise. Finally, we use depth (and other features) to predict pK(a)s of GLU, ASP, LYS and HIS residues. Our results produce an average error of just <1 pH unit over 60 predictions. Our simple empirical method is statistically on par with two and superior to three other methods while inferior to only one. The DEPTH server (http://mspc.bii.a-star.edu.sg/depth/) is an ideal tool for rapid yet accurate structural analyses of protein structures.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Temperature sensitive (Ts) mutants of proteins provide experimentalists with a powerful and reversible way of conditionally expressing genes. The technique has been widely used in determining the role of gene and gene products in several cellular processes. Traditionally, Ts mutants are generated by random mutagenesis and then selected though laborious large-scale screening. Our web server, TSpred (http://mspc.bii.a-star.edu.sg/TSpred/), now enables users to rationally design Ts mutants for their proteins of interest. TSpred uses hydrophobicity and hydrophobic moment, deduced from primary sequence and residue depth, inferred from 3D structures to predict/identify buried hydrophobic residues. Mutating these residues leads to the creation of Ts mutants. Our method has been experimentally validated in 36 positions in six different proteins. It is an attractive proposition for Ts mutant engineering as it proposes a small number of mutations and with high precision. The accompanying web server is simple and intuitive to use and can handle proteins and protein complexes of different sizes.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Background: The function of a protein can be deciphered with higher accuracy from its structure than from its amino acid sequence. Due to the huge gap in the available protein sequence and structural space, tools that can generate functionally homogeneous clusters using only the sequence information, hold great importance. For this, traditional alignment-based tools work well in most cases and clustering is performed on the basis of sequence similarity. But, in the case of multi-domain proteins, the alignment quality might be poor due to varied lengths of the proteins, domain shuffling or circular permutations. Multi-domain proteins are ubiquitous in nature, hence alignment-free tools, which overcome the shortcomings of alignment-based protein comparison methods, are required. Further, existing tools classify proteins using only domain-level information and hence miss out on the information encoded in the tethered regions or accessory domains. Our method, on the other hand, takes into account the full-length sequence of a protein, consolidating the complete sequence information to understand a given protein better. Results: Our web-server, CLAP (Classification of Proteins), is one such alignment-free software for automatic classification of protein sequences. It utilizes a pattern-matching algorithm that assigns local matching scores (LMS) to residues that are a part of the matched patterns between two sequences being compared. CLAP works on full-length sequences and does not require prior domain definitions. Pilot studies undertaken previously on protein kinases and immunoglobulins have shown that CLAP yields clusters, which have high functional and domain architectural similarity. Moreover, parsing at a statistically determined cut-off resulted in clusters that corroborated with the sub-family level classification of that particular domain family. Conclusions: CLAP is a useful protein-clustering tool, independent of domain assignment, domain order, sequence length and domain diversity. Our method can be used for any set of protein sequences, yielding functionally relevant clusters with high domain architectural homogeneity. The CLAP web server is freely available for academic use at http://nslab.mbu.iisc.ernet.in/clap/.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Background: Haemophilus influenzae (H. Influenzae) is the causative agent of pneumonia, bacteraemia and meningitis. The organism is responsible for large number of deaths in both developed and developing countries. Even-though the first bacterial genome to be sequenced was that of H. Influenzae, there is no exclusive database dedicated for H. Influenzae. This prompted us to develop the Haemophilus influenzae Genome Database (HIGDB). Methods: All data of HIGDB are stored and managed in MySQL database. The HIGDB is hosted on Solaris server and developed using PERL modules. Ajax and JavaScript are used for the interface development. Results: The HIGDB contains detailed information on 42,741 proteins, 18,077 genes including 10 whole genome sequences and also 284 three dimensional structures of proteins of H. influenzae. In addition, the database provides ``Motif search'' and ``GBrowse''. The HIGDB is freely accessible through the URL:http://bioserverl.physicslisc.ernetin/HIGDB/. Discussion: The HIGDB will be a single point access for bacteriological, clinical, genomic and proteomic information of H. influenzae. The database can also be used to identify DNA motifs within H. influenzae genomes and to compare gene or protein sequences of a particular strain with other strains of H. influenzae. (C) 2014 Elsevier Ltd. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

An online computing server, Online_DPI (where DPI denotes the diffraction precision index), has been created to calculate the `Cruickshank DPI' value for a given three-dimensional protein or macromolecular structure. It also estimates the atomic coordinate error for all the atoms available in the structure. It is an easy-to-use web server that enables users to visualize the computed values dynamically on the client machine. Users can provide the Protein Data Bank (PDB) identification code or upload the three-dimensional atomic coordinates from the client machine. The computed DPI value for the structure and the atomic coordinate errors for all the atoms are included in the revised PDB file. Further, users can graphically view the atomic coordinate error along with `temperature factors' (i.e. atomic displacement parameters). In addition, the computing engine is interfaced with an up-to-date local copy of the Protein Data Bank. New entries are updated every week, and thus users can access all the structures available in the Protein Data Bank. The computing engine is freely accessible online at http://cluster.physics.iisc.ernet.in/dpi/.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The broader goal of the research being described here is to automatically acquire diagnostic knowledge from documents in the domain of manual and mechanical assembly of aircraft structures. These documents are treated as a discourse used by experts to communicate with others. It therefore becomes possible to use discourse analysis to enable machine understanding of the text. The research challenge addressed in the paper is to identify documents or sections of documents that are potential sources of knowledge. In a subsequent step, domain knowledge will be extracted from these segments. The segmentation task requires partitioning the document into relevant segments and understanding the context of each segment. In discourse analysis, the division of a discourse into various segments is achieved through certain indicative clauses called cue phrases that indicate changes in the discourse context. However, in formal documents such language may not be used. Hence the use of a domain specific ontology and an assembly process model is proposed to segregate chunks of the text based on a local context. Elements of the ontology/model, and their related terms serve as indicators of current context for a segment and changes in context between segments. Local contexts are aggregated for increasingly larger segments to identify if the document (or portions of it) pertains to the topic of interest, namely, assembly. Knowledge acquired through such processes enables acquisition and reuse of knowledge during any part of the lifecycle of a product.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In optical character recognition of very old books, the recognition accuracy drops mainly due to the merging or breaking of characters. In this paper, we propose the first algorithm to segment merged Kannada characters by using a hypothesis to select the positions to be cut. This method searches for the best possible positions to segment, by taking into account the support vector machine classifier's recognition score and the validity of the aspect ratio (width to height ratio) of the segments between every pair of cut positions. The hypothesis to select the cut position is based on the fact that a concave surface exists above and below the touching portion. These concave surfaces are noted down by tracing the valleys in the top contour of the image and similarly doing it for the image rotated upside-down. The cut positions are then derived as closely matching valleys of the original and the rotated images. Our proposed segmentation algorithm works well for different font styles, shapes and sizes better than the existing vertical projection profile based segmentation. The proposed algorithm has been tested on 1125 different word images, each containing multiple merged characters, from an old Kannada book and 89.6% correct segmentation is achieved and the character recognition accuracy of merged words is 91.2%. A few points of merge are still missed due to the absence of a matched valley due to the specific shapes of the particular characters meeting at the merges.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Hydrogen bonds in biological macromolecules play significant structural and functional roles. They are the key contributors to most of the interactions without which no living system exists. In view of this, a web-based computing server, the Hydrogen Bonds Computing Server (HBCS), has been developed to compute hydrogen-bond interactions and their standard deviations for any given macromolecular structure. The computing server is connected to a locally maintained Protein Data Bank (PDB) archive. Thus, the user can calculate the above parameters for any deposited structure, and options have also been provided for the user to upload a structure in PDB format from the client machine. In addition, the server has been interfaced with the molecular viewers Jmol and JSmol to visualize the hydrogen-bond interactions. The proposed server is freely available and accessible via the World Wide Web at http://bioserver1.physics.iisc.ernet.in/hbcs/.