975 resultados para similarity search


Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background The RCSB Protein Data Bank (PDB) provides public access to experimentally determined 3D-structures of biological macromolecules (proteins, peptides and nucleic acids). While various tools are available to explore the PDB, options to access the global structural diversity of the entire PDB and to perceive relationships between PDB structures remain very limited. Methods A 136-dimensional atom pair 3D-fingerprint for proteins (3DP) counting categorized atom pairs at increasing through-space distances was designed to represent the molecular shape of PDB-entries. Nearest neighbor searches examples were reported exemplifying the ability of 3DP-similarity to identify closely related biomolecules from small peptides to enzyme and large multiprotein complexes such as virus particles. The principle component analysis was used to obtain the visualization of PDB in 3DP-space. Results The 3DP property space groups proteins and protein assemblies according to their 3D-shape similarity, yet shows exquisite ability to distinguish between closely related structures. An interactive website called PDB-Explorer is presented featuring a color-coded interactive map of PDB in 3DP-space. Each pixel of the map contains one or more PDB-entries which are directly visualized as ribbon diagrams when the pixel is selected. The PDB-Explorer website allows performing 3DP-nearest neighbor searches of any PDB-entry or of any structure uploaded as protein-type PDB file. All functionalities on the website are implemented in JavaScript in a platform-independent manner and draw data from a server that is updated daily with the latest PDB additions, ensuring complete and up-to-date coverage. The essentially instantaneous 3DP-similarity search with the PDB-Explorer provides results comparable to those of much slower 3D-alignment algorithms, and automatically clusters proteins from the same superfamilies in tight groups. Conclusion A chemical space classification of PDB based on molecular shape was obtained using a new atom-pair 3D-fingerprint for proteins and implemented in a web-based database exploration tool comprising an interactive color-coded map of the PDB chemical space and a nearest neighbor search tool. The PDB-Explorer website is freely available at www.​cheminfo.​org/​pdbexplorer and represents an unprecedented opportunity to interactively visualize and explore the structural diversity of the PDB.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The emergence of cloud datacenters enhances the capability of online data storage. Since massive data is stored in datacenters, it is necessary to effectively locate and access interest data in such a distributed system. However, traditional search techniques only allow users to search images over exact-match keywords through a centralized index. These techniques cannot satisfy the requirements of content based image retrieval (CBIR). In this paper, we propose a scalable image retrieval framework which can efficiently support content similarity search and semantic search in the distributed environment. Its key idea is to integrate image feature vectors into distributed hash tables (DHTs) by exploiting the property of locality sensitive hashing (LSH). Thus, images with similar content are most likely gathered into the same node without the knowledge of any global information. For searching semantically close images, the relevance feedback is adopted in our system to overcome the gap between low-level features and high-level features. We show that our approach yields high recall rate with good load balance and only requires a few number of hops.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Three-week-old plants of two unrelated lines of maize (Zea mays L.) and their hybrid were submitted to progressive water stress for 10 d. Changes induced in leaf proteins were studied by two-dimensional electrophoresis and quantitatively analyzed using image analysis. Seventy-eight proteins out of a total of 413 showed a significant quantitative variation (increase or decrease), with 38 of them exhibiting a different expression in the two genotypes. Eleven proteins that increased by a factor of 1.3 to 5 in stressed plants and 8 proteins detected only in stressed plants were selected for internal amino acid microsequencing, and by similarity search 16 were found to be closely related to previously reported proteins. In addition to proteins already known to be involved in the response to water stress (e.g. RAB17 [Responsive to ABA]), several enzymes involved in basic metabolic cellular pathways such as glycolysis and the Krebs cycle (e.g. enolase and triose phosphate isomerase) were identified, as well as several others, including caffeate O-methyltransferase, the induction of which could be related to lignification.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Scorpion toxins are important experimental tools for characterization of vast array of ion channels and serve as scaffolds for drug design. General public database entries contain limited annotation whereby rich structure-function information from mutation studies is typically not available. SCORPION2 contains more than 800 records of native and mutant toxin sequences enriched with binding affinity and toxicity information, 624 three-dimensional structures and some 500 references. SCORPION2 has a set of search and prediction tools that allow users to extract and perform specific queries: text searches of scorpion toxin records, sequence similarity search, extraction of sequences, visualization of scorpion toxin structures, analysis of toxic activity, and functional annotation of previously uncharacterized scorpion toxins. The SCORPION2 database is available at http://sdmc.i2r.a-star.edu.sg/scorpion/. (c) 2006 Elsevier Ltd. All rights reserved.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

In this paper, we propose a novel high-dimensional index method, the BM+-tree, to support efficient processing of similarity search queries in high-dimensional spaces. The main idea of the proposed index is to improve data partitioning efficiency in a high-dimensional space by using a rotary binary hyperplane, which further partitions a subspace and can also take advantage of the twin node concept used in the M+-tree. Compared with the key dimension concept in the M+-tree, the binary hyperplane is more effective in data filtering. High space utilization is achieved by dynamically performing data reallocation between twin nodes. In addition, a post processing step is used after index building to ensure effective filtration. Experimental results using two types of real data sets illustrate a significantly improved filtering efficiency.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

With rapid advances in video processing technologies and ever fast increments in network bandwidth, the popularity of video content publishing and sharing has made similarity search an indispensable operation to retrieve videos of user interests. The video similarity is usually measured by the percentage of similar frames shared by two video sequences, and each frame is typically represented as a high-dimensional feature vector. Unfortunately, high complexity of video content has posed the following major challenges for fast retrieval: (a) effective and compact video representations, (b) efficient similarity measurements, and (c) efficient indexing on the compact representations. In this paper, we propose a number of methods to achieve fast similarity search for very large video database. First, each video sequence is summarized into a small number of clusters, each of which contains similar frames and is represented by a novel compact model called Video Triplet (ViTri). ViTri models a cluster as a tightly bounded hypersphere described by its position, radius, and density. The ViTri similarity is measured by the volume of intersection between two hyperspheres multiplying the minimal density, i.e., the estimated number of similar frames shared by two clusters. The total number of similar frames is then estimated to derive the overall similarity between two video sequences. Hence the time complexity of video similarity measure can be reduced greatly. To further reduce the number of similarity computations on ViTris, we introduce a new one dimensional transformation technique which rotates and shifts the original axis system using PCA in such a way that the original inter-distance between two high-dimensional vectors can be maximally retained after mapping. An efficient B+-tree is then built on the transformed one dimensional values of ViTris' positions. Such a transformation enables B+-tree to achieve its optimal performance by quickly filtering a large portion of non-similar ViTris. Our extensive experiments on real large video datasets prove the effectiveness of our proposals that outperform existing methods significantly.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The rapid growth of the Internet and the advancements of the Web technologies have made it possible for users to have access to large amounts of on-line music data, including music acoustic signals, lyrics, style/mood labels, and user-assigned tags. The progress has made music listening more fun, but has raised an issue of how to organize this data, and more generally, how computer programs can assist users in their music experience. An important subject in computer-aided music listening is music retrieval, i.e., the issue of efficiently helping users in locating the music they are looking for. Traditionally, songs were organized in a hierarchical structure such as genre->artist->album->track, to facilitate the users’ navigation. However, the intentions of the users are often hard to be captured in such a simply organized structure. The users may want to listen to music of a particular mood, style or topic; and/or any songs similar to some given music samples. This motivated us to work on user-centric music retrieval system to improve users’ satisfaction with the system. The traditional music information retrieval research was mainly concerned with classification, clustering, identification, and similarity search of acoustic data of music by way of feature extraction algorithms and machine learning techniques. More recently the music information retrieval research has focused on utilizing other types of data, such as lyrics, user-access patterns, and user-defined tags, and on targeting non-genre categories for classification, such as mood labels and styles. This dissertation focused on investigating and developing effective data mining techniques for (1) organizing and annotating music data with styles, moods and user-assigned tags; (2) performing effective analysis of music data with features from diverse information sources; and (3) recommending music songs to the users utilizing both content features and user access patterns.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Relationship between organisms within an ecosystem is one of the main focuses in the study of ecology and evolution. For instance, host-parasite interactions have long been under close interest of ecology, evolutionary biology and conservation science, due to great variety of strategies and interaction outcomes. The monogenean ecto-parasites consist of a significant portion of flatworms. Gyrodactylus salaris is a monogenean freshwater ecto-parasite of Atlantic salmon (Salmo salar) whose damage can make fish to be prone to further bacterial and fungal infections. G. salaris is the only one parasite whose genome has been studied so far. The RNA-seq data analyzed in this thesis has already been annotated by using LAST. The RNA-seq data was obtained from Illumina sequencing i.e. yielded reads were assembled into 15777 transcripts. Last resulted in annotation of 46% transcripts and remaining were left unknown. This thesis work was started with whole data and annotation process was continued by the use of PANNZER, CDD and InterProScan. This annotation resulted in 56% successfully annotated sequences having parasite specific proteins identified. This thesis represents the first of Monogenean transcriptomic information which gives an important source for further research on this specie. Additionally, comparison of annotation methods interestingly revealed that description and domain based methods perform better than simple similarity search methods. Therefore it is more likely to suggest the use of these tools and databases for functional annotation. These results also emphasize the need for use of multiple methods and databases. It also highlights the need of more genomic information related to G. salaris.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

Background: Aspergillosis has been identified as one of the hospital acquired infections but the contribution of water and inhouse air as possible sources of Aspergillus infection in immunocompromised individuals like HIV-TB patients have not been studied in any hospital setting in Nigeria. Objective: To identify and investigate genetic relationship between clinical and environmental Aspergillus species associated with HIV-TB co infected patients. Methods: DNA extraction, purification, amplification and sequencing of Internal Transcribed Spacer (ITS) genes were performed using standard protocols. Similarity search using BLAST on NCBI was used for species identification and MEGA 5.0 was used for phylogenetic analysis. Results: Analyses of sequenced ITS genes of selected fourteen (14) Aspergillus isolates identified in the GenBank database revealed Aspergillus niger (28.57%), Aspergillus tubingensis (7.14%), Aspergillus flavus (7.14%) and Aspergillus fumigatus (57.14%). Aspergillus in sputum of HIV patients were Aspergillus niger, A. fumigatus, A. tubingensis and A. flavus. Also, A. niger and A. fumigatus were identified from water and open-air. Phylogenetic analysis of sequences yielded genetic relatedness between clinical and environmental isolates. Conclusion: Water and air in health care settings in Nigeria are important sources of Aspergillus sp. for HIV-TB patients.

Relevância:

60.00% 60.00%

Publicador:

Resumo:

The size of online image datasets is constantly increasing. Considering an image dataset with millions of images, image retrieval becomes a seemingly intractable problem for exhaustive similarity search algorithms. Hashing methods, which encodes high-dimensional descriptors into compact binary strings, have become very popular because of their high efficiency in search and storage capacity. In the first part, we propose a multimodal retrieval method based on latent feature models. The procedure consists of a nonparametric Bayesian framework for learning underlying semantically meaningful abstract features in a multimodal dataset, a probabilistic retrieval model that allows cross-modal queries and an extension model for relevance feedback. In the second part, we focus on supervised hashing with kernels. We describe a flexible hashing procedure that treats binary codes and pairwise semantic similarity as latent and observed variables, respectively, in a probabilistic model based on Gaussian processes for binary classification. We present a scalable inference algorithm with the sparse pseudo-input Gaussian process (SPGP) model and distributed computing. In the last part, we define an incremental hashing strategy for dynamic databases where new images are added to the databases frequently. The method is based on a two-stage classification framework using binary and multi-class SVMs. The proposed method also enforces balance in binary codes by an imbalance penalty to obtain higher quality binary codes. We learn hash functions by an efficient algorithm where the NP-hard problem of finding optimal binary codes is solved via cyclic coordinate descent and SVMs are trained in a parallelized incremental manner. For modifications like adding images from an unseen class, we propose an incremental procedure for effective and efficient updates to the previous hash functions. Experiments on three large-scale image datasets demonstrate that the incremental strategy is capable of efficiently updating hash functions to the same retrieval performance as hashing from scratch.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

How do we perform rapid visual categorization?It is widely thought that categorization involves evaluating the similarity of an object to other category items, but the underlying features and similarity relations remain unknown. Here, we hypothesized that categorization performance is based on perceived similarity relations between items within and outside the category. To this end, we measured the categorization performance of human subjects on three diverse visual categories (animals, vehicles, and tools) and across three hierarchical levels (superordinate, basic, and subordinate levels among animals). For the same subjects, we measured their perceived pair-wise similarities between objects using a visual search task. Regardless of category and hierarchical level, we found that the time taken to categorize an object could be predicted using its similarity to members within and outside its category. We were able to account for several classic categorization phenomena, such as (a) the longer times required to reject category membership; (b) the longer times to categorize atypical objects; and (c) differences in performance across tasks and across hierarchical levels. These categorization times were also accounted for by a model that extracts coarse structure from an image. The striking agreement observed between categorization and visual search suggests that these two disparate tasks depend on a shared coarse object representation.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Since multimedia data, such as images and videos, are way more expressive and informative than ordinary text-based data, people find it more attractive to communicate and express with them. Additionally, with the rising popularity of social networking tools such as Facebook and Twitter, multimedia information retrieval can no longer be considered a solitary task. Rather, people constantly collaborate with one another while searching and retrieving information. But the very cause of the popularity of multimedia data, the huge and different types of information a single data object can carry, makes their management a challenging task. Multimedia data is commonly represented as multidimensional feature vectors and carry high-level semantic information. These two characteristics make them very different from traditional alpha-numeric data. Thus, to try to manage them with frameworks and rationales designed for primitive alpha-numeric data, will be inefficient. An index structure is the backbone of any database management system. It has been seen that index structures present in existing relational database management frameworks cannot handle multimedia data effectively. Thus, in this dissertation, a generalized multidimensional index structure is proposed which accommodates the atypical multidimensional representation and the semantic information carried by different multimedia data seamlessly from within one single framework. Additionally, the dissertation investigates the evolving relationships among multimedia data in a collaborative environment and how such information can help to customize the design of the proposed index structure, when it is used to manage multimedia data in a shared environment. Extensive experiments were conducted to present the usability and better performance of the proposed framework over current state-of-art approaches.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In vector space based approaches to natural language processing, similarity is commonly measured by taking the angle between two vectors representing words or documents in a semantic space. This is natural from a mathematical point of view, as the angle between unit vectors is, up to constant scaling, the only unitarily invariant metric on the unit sphere. However, similarity judgement tasks reveal that human subjects fail to produce data which satisfies the symmetry and triangle inequality requirements for a metric space. A possible conclusion, reached in particular by Tversky et al., is that some of the most basic assumptions of geometric models are unwarranted in the case of psychological similarity, a result which would impose strong limits on the validity and applicability vector space based (and hence also quantum inspired) approaches to the modelling of cognitive processes. This paper proposes a resolution to this fundamental criticism of of the applicability of vector space models of cognition. We argue that pairs of words imply a context which in turn induces a point of view, allowing a subject to estimate semantic similarity. Context is here introduced as a point of view vector (POVV) and the expected similarity is derived as a measure over the POVV's. Different pairs of words will invoke different contexts and different POVV's. Hence the triangle inequality ceases to be a valid constraint on the angles. We test the proposal on a few triples of words and outline further research.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper analyses the pairwise distances of signatures produced by the TopSig retrieval model on two document collections. The distribution of the distances are compared to purely random signatures. It explains why TopSig is only competitive with state of the art retrieval models at early precision. Only the local neighbourhood of the signatures is interpretable. We suggest this is a common property of vector space models.