972 resultados para Automatic Image Annotation
Resumo:
Background: We present the results of EGASP, a community experiment to assess the state-ofthe-art in genome annotation within the ENCODE regions, which span 1% of the human genomesequence. The experiment had two major goals: the assessment of the accuracy of computationalmethods to predict protein coding genes; and the overall assessment of the completeness of thecurrent human genome annotations as represented in the ENCODE regions. For thecomputational prediction assessment, eighteen groups contributed gene predictions. Weevaluated these submissions against each other based on a ‘reference set’ of annotationsgenerated as part of the GENCODE project. These annotations were not available to theprediction groups prior to the submission deadline, so that their predictions were blind and anexternal advisory committee could perform a fair assessment.Results: The best methods had at least one gene transcript correctly predicted for close to 70%of the annotated genes. Nevertheless, the multiple transcript accuracy, taking into accountalternative splicing, reached only approximately 40% to 50% accuracy. At the coding nucleotidelevel, the best programs reached an accuracy of 90% in both sensitivity and specificity. Programsrelying on mRNA and protein sequences were the most accurate in reproducing the manuallycurated annotations. Experimental validation shows that only a very small percentage (3.2%) of the selected 221 computationally predicted exons outside of the existing annotation could beverified.Conclusions: This is the first such experiment in human DNA, and we have followed thestandards established in a similar experiment, GASP1, in Drosophila melanogaster. We believe theresults presented here contribute to the value of ongoing large-scale annotation projects and shouldguide further experimental methods when being scaled up to the entire human genome sequence.
Resumo:
This paper proposes a novel approach for the analysis of illicit tablets based on their visual characteristics. In particular, the paper concentrates on the problem of ecstasy pill seizure profiling and monitoring. The presented method extracts the visual information from pill images and builds a representation of it, i.e. it builds a pill profile based on the pill visual appearance. Different visual features are used to build different image similarity measures, which are the basis for a pill monitoring strategy based on both discriminative and clustering models. The discriminative model permits to infer whether two pills come from the same seizure, while the clustering models groups of pills that share similar visual characteristics. The resulting clustering structure allows to perform a visual identification of the relationships between different seizures. The proposed approach was evaluated using a data set of 621 Ecstasy pill pictures. The results demonstrate that this is a feasible and cost effective method for performing pill profiling and monitoring.
Resumo:
Background: Single Nucleotide Polymorphisms, among other type of sequence variants, constitute key elements in genetic epidemiology and pharmacogenomics. While sequence data about genetic variation is found at databases such as dbSNP, clues about the functional and phenotypic consequences of the variations are generally found in biomedical literature. The identification of the relevant documents and the extraction of the information from them are hampered by the large size of literature databases and the lack of widely accepted standard notation for biomedical entities. Thus, automatic systems for the identification of citations of allelic variants of genes in biomedical texts are required. Results: Our group has previously reported the development of OSIRIS, a system aimed at the retrieval of literature about allelic variants of genes http://ibi.imim.es/osirisform.html. Here we describe the development of a new version of OSIRIS (OSIRISv1.2, http://ibi.imim.es/OSIRISv1.2.html webcite) which incorporates a new entity recognition module and is built on top of a local mirror of the MEDLINE collection and HgenetInfoDB: a database that collects data on human gene sequence variations. The new entity recognition module is based on a pattern-based search algorithm for the identification of variation terms in the texts and their mapping to dbSNP identifiers. The performance of OSIRISv1.2 was evaluated on a manually annotated corpus, resulting in 99% precision, 82% recall, and an F-score of 0.89. As an example, the application of the system for collecting literature citations for the allelic variants of genes related to the diseases intracranial aneurysm and breast cancer is presented. Conclusion: OSIRISv1.2 can be used to link literature references to dbSNP database entries with high accuracy, and therefore is suitable for collecting current knowledge on gene sequence variations and supporting the functional annotation of variation databases. The application of OSIRISv1.2 in combination with controlled vocabularies like MeSH provides a way to identify associations of biomedical interest, such as those that relate SNPs with diseases.
Resumo:
Purpose: To evaluate the suitability of an improved version of an automatic segmentation method based on geodesic active regions (GAR) for segmenting cerebral vasculature with aneurysms from 3D X-ray reconstruc-tion angiography (3DRA) and time of °ight magnetic resonance angiography (TOF-MRA) images available in the clinical routine.Methods: Three aspects of the GAR method have been improved: execution time, robustness to variability in imaging protocols and robustness to variability in image spatial resolutions. The improved GAR was retrospectively evaluated on images from patients containing intracranial aneurysms in the area of the Circle of Willis and imaged with two modalities: 3DRA and TOF-MRA. Images were obtained from two clinical centers, each using di®erent imaging equipment. Evaluation included qualitative and quantitative analyses ofthe segmentation results on 20 images from 10 patients. The gold standard was built from 660 cross-sections (33 per image) of vessels and aneurysms, manually measured by interventional neuroradiologists. GAR has also been compared to an interactive segmentation method: iso-intensity surface extraction (ISE). In addition, since patients had been imaged with the two modalities, we performed an inter-modality agreement analysis with respect to both the manual measurements and each of the two segmentation methods. Results: Both GAR and ISE di®ered from the gold standard within acceptable limits compared to the imaging resolution. GAR (ISE, respectively) had an average accuracy of 0.20 (0.24) mm for 3DRA and 0.27 (0.30) mm for TOF-MRA, and had a repeatability of 0.05 (0.20) mm. Compared to ISE, GAR had a lower qualitative error in the vessel region and a lower quantitative error in the aneurysm region. The repeatabilityof GAR was superior to manual measurements and ISE. The inter-modality agreement was similar between GAR and the manual measurements. Conclusions: The improved GAR method outperformed ISE qualitatively as well as quantitatively and is suitable for segmenting 3DRA and TOF-MRA images from clinical routine.
Resumo:
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
Resumo:
In this work we propose a new automatic methodology for computing accurate digital elevation models (DEMs) in urban environments from low baseline stereo pairs that shall be available in the future from a new kind of earth observation satellite. This setting makes both views of the scene similarly, thus avoiding occlusions and illumination changes, which are the main disadvantages of the commonly accepted large-baseline configuration. There still remain two crucial technological challenges: (i) precisely estimating DEMs with strong discontinuities and (ii) providing a statistically proven result, automatically. The first one is solved here by a piecewise affine representation that is well adapted to man-made landscapes, whereas the application of computational Gestalt theory introduces reliability and automation. In fact this theory allows us to reduce the number of parameters to be adjusted, and tocontrol the number of false detections. This leads to the selection of a suitable segmentation into affine regions (whenever possible) by a novel and completely automatic perceptual grouping method. It also allows us to discriminate e.g. vegetation-dominated regions, where such an affine model does not apply anda more classical correlation technique should be preferred. In addition we propose here an extension of the classical ”quantized” Gestalt theory to continuous measurements, thus combining its reliability with the precision of variational robust estimation and fine interpolation methods that are necessary in the low baseline case. Such an extension is very general and will be useful for many other applications as well.
Resumo:
User generated content shared in online communities is often described using collaborative tagging systems where users assign labels to content resources. As a result, a folksonomy emerges that relates a number of tags with the resources they label and the users that have used them. In this paper we analyze the folksonomy of Freesound, an online audio clip sharing site which contains more than two million users and 150,000 user-contributed sound samplescovering a wide variety of sounds. By following methodologies taken from similar studies, we compute some metrics that characterize the folksonomy both at the globallevel and at the tag level. In this manner, we are able to betterunderstand the behavior of the folksonomy as a whole, and also obtain some indicators that can be used as metadata for describing tags themselves. We expect that such a methodology for characterizing folksonomies can be useful to support processes such as tag recommendation or automatic annotation of online resources.
Resumo:
Automatic classification of makams from symbolic data is a rarely studied topic. In this paper, first a review of an n-gram based approach is presented using various representations of the symbolic data. While a high degree of precision can be obtained, confusion happens mainly for makams using (almost) the same scale and pitch hierarchy but differ in overall melodic progression, seyir. To further improve the system, first n-gram based classification is tested for various sections of the piece to take into account a feature of the seyir that melodic progression starts in a certain region of the scale. In a second test, a hierarchical classification structure is designed which uses n-grams and seyir features in different levels to further improve the system.
Resumo:
Language Resources are a critical component for Natural Language Processing applications. Throughout the years many resources were manually created for the same task, but with different granularity and coverage information. To create richer resources for a broad range of potential reuses, nformation from all resources has to be joined into one. The hight cost of comparing and merging different resources by hand has been a bottleneck for merging existing resources. With the objective of reducing human intervention, we present a new method for automating merging resources. We have addressed the merging of two verbs subcategorization frame (SCF) lexica for Spanish. The results achieved, a new lexicon with enriched information and conflicting information signalled, reinforce our idea that this approach can be applied for other task of NLP.
Resumo:
This article reports on the results of the research done towards the fully automatically merging of lexical resources. Our main goal is to show the generality of the proposed approach, which have been previously applied to merge Spanish Subcategorization Frames lexica. In this work we extend and apply the same technique to perform the merging of morphosyntactic lexica encoded in LMF. The experiments showed that the technique is general enough to obtain good results in these two different tasks which is an important step towards performing the merging of lexical resources fully automatically.
Resumo:
The work we present here addresses cue-based noun classification in English and Spanish. Its main objective is to automatically acquire lexical semantic information by classifying nouns into previously known noun lexical classes. This is achieved by using particular aspects of linguistic contexts as cues that identify a specific lexical class. Here we concentrate on the task of identifying such cues and the theoretical background that allows for an assessment of the complexity of the task. The results show that, despite of the a-priori complexity of the task, cue-based classification is a useful tool in the automatic acquisition of lexical semantic classes.
Resumo:
Automatic creation of polarity lexicons is a crucial issue to be solved in order to reduce time andefforts in the first steps of Sentiment Analysis. In this paper we present a methodology based onlinguistic cues that allows us to automatically discover, extract and label subjective adjectivesthat should be collected in a domain-based polarity lexicon. For this purpose, we designed abootstrapping algorithm that, from a small set of seed polar adjectives, is capable to iterativelyidentify, extract and annotate positive and negative adjectives. Additionally, the methodautomatically creates lists of highly subjective elements that change their prior polarity evenwithin the same domain. The algorithm proposed reached a precision of 97.5% for positiveadjectives and 71.4% for negative ones in the semantic orientation identification task.
Resumo:
Lexical Resources are a critical component for Natural Language Processing applications. However, the high cost of comparing and merging different resources has been a bottleneck to have richer resources with a broad range of potential uses for a significant number of languages.With the objective of reducing cost byeliminating human intervention, we present a new method for automating the merging of resources,with special emphasis in what we call the mapping step. This mapping step, which converts the resources into a common format that allows latter the merging, is usually performed with huge manual effort and thus makes the whole process very costly. Thus, we propose a method to perform this mapping fully automatically. To test our method, we have addressed the merging of two verb subcategorization frame lexica for Spanish, The resultsachieved, that almost replicate human work, demonstrate the feasibility of the approach.
Resumo:
In this work we present the results of experimental work on the development of lexical class-based lexica by automatic means. Our purpose is to assess the use of linguistic lexical-class based information as a feature selection methodology for the use of classifiers in quick lexical development. The results show that the approach can help reduce the human effort required in the development of language resources significantly.
Resumo:
Lexical Resources are a critical component for Natural Language Processing applications. However, the high cost of comparing and merging different resources has been a bottleneck to obtain richer resources and a broader range of potential uses for a significant number of languages. With the objective of reducing cost by eliminating human intervention, we present a new method towards the automatic merging of resources. This method includes both, the automatic mapping of resources involved to a common format and merging them, once in this format. This paper presents how we have addressed the merging of two verb subcategorization frame lexica for Spanish, but our method will be extended to cover other types of Lexical Resources. The achieved results, that almost replicate human work, demonstrate the feasibility of the approach.