836 resultados para Text retrieval
Resumo:
Magdeburg, Univ., Med. Fak., Diss., 2014
Resumo:
BACKGROUND: The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB. RESULTS: The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments. CONCLUSIONS: The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at http://eagl.unige.ch/PTM/.
Resumo:
Internet on elektronisen postin perusrakenne ja ollut tärkeä tiedonlähde akateemisille käyttäjille jo pitkään. Siitä on tullut merkittävä tietolähde kaupallisille yrityksille niiden pyrkiessä pitämään yhteyttä asiakkaisiinsa ja seuraamaan kilpailijoitansa. WWW:n kasvu sekä määrällisesti että sen moninaisuus on luonut kasvavan kysynnän kehittyneille tiedonhallintapalveluille. Tällaisia palveluja ovet ryhmittely ja luokittelu, tiedon löytäminen ja suodattaminen sekä lähteiden käytön personointi ja seuranta. Vaikka WWW:stä saatavan tieteellisen ja kaupallisesti arvokkaan tiedon määrä on huomattavasti kasvanut viime vuosina sen etsiminen ja löytyminen on edelleen tavanomaisen Internet hakukoneen varassa. Tietojen hakuun kohdistuvien kasvavien ja muuttuvien tarpeiden tyydyttämisestä on tullut monimutkainen tehtävä Internet hakukoneille. Luokittelu ja indeksointi ovat merkittävä osa luotettavan ja täsmällisen tiedon etsimisessä ja löytämisessä. Tämä diplomityö esittelee luokittelussa ja indeksoinnissa käytettävät yleisimmät menetelmät ja niitä käyttäviä sovelluksia ja projekteja, joissa tiedon hakuun liittyvät ongelmat on pyritty ratkaisemaan.
Resumo:
Recent advances in machine learning methods enable increasingly the automatic construction of various types of computer assisted methods that have been difficult or laborious to program by human experts. The tasks for which this kind of tools are needed arise in many areas, here especially in the fields of bioinformatics and natural language processing. The machine learning methods may not work satisfactorily if they are not appropriately tailored to the task in question. However, their learning performance can often be improved by taking advantage of deeper insight of the application domain or the learning problem at hand. This thesis considers developing kernel-based learning algorithms incorporating this kind of prior knowledge of the task in question in an advantageous way. Moreover, computationally efficient algorithms for training the learning machines for specific tasks are presented. In the context of kernel-based learning methods, the incorporation of prior knowledge is often done by designing appropriate kernel functions. Another well-known way is to develop cost functions that fit to the task under consideration. For disambiguation tasks in natural language, we develop kernel functions that take account of the positional information and the mutual similarities of words. It is shown that the use of this information significantly improves the disambiguation performance of the learning machine. Further, we design a new cost function that is better suitable for the task of information retrieval and for more general ranking problems than the cost functions designed for regression and classification. We also consider other applications of the kernel-based learning algorithms such as text categorization, and pattern recognition in differential display. We develop computationally efficient algorithms for training the considered learning machines with the proposed kernel functions. We also design a fast cross-validation algorithm for regularized least-squares type of learning algorithm. Further, an efficient version of the regularized least-squares algorithm that can be used together with the new cost function for preference learning and ranking tasks is proposed. In summary, we demonstrate that the incorporation of prior knowledge is possible and beneficial, and novel advanced kernels and cost functions can be used in algorithms efficiently.
Resumo:
A neural network procedure to solve inverse chemical kinetic problems is discussed in this work. Rate constants are calculated from the product concentration of an irreversible consecutive reaction: the hydrogenation of Citral molecule, a process with industrial interest. Simulated and experimental data are considered. Errors in the simulated data, up to 7% in the concentrations, were assumed to investigate the robustness of the inverse procedure. Also, the proposed method is compared with two common methods in nonlinear analysis; the Simplex and Levenberg-Marquardt approaches. In all situations investigated, the neural network approach was numerically stable and robust with respect to deviations in the initial conditions or experimental noises.
Resumo:
Undernutrition of dams and pups disrupts the retrieval efficiency of mothers. However, if the mothers are assessed in their home cages, they spend more time with their litters. In the present study the effect of test conditions on pup retrieval behavior of mothers receiving a 25% (well-nourished group) and 8% casein diet (undernourished group) was examined. In agreement with previous studies, undernourished mothers spent more time with their litters than well-nourished dams as lactation proceeded. Pup retrieval behavior varied with test conditions. In the first experiment, the maternal behavior of dams was assessed by the standard procedure (pups were separated from their mother and scattered over the floor of the home cage). The mother was then returned and the number of retrieved pups was recorded. From day 3 to 8, the retrieval efficiency of undernourished dams decreased, while the retrieval efficiency of well-nourished mothers did not vary. In the second experiment, mothers were subjected to a single retrieval test (on day 9 of lactation) using the procedure described for experiment 1. No difference between well-nourished and undernourished mothers was observed. In the third experiment, seven-day-old pups were separated from the mothers and returned individually to a clean home cage. Dietary treatment did not affect the retrieval efficiency. However, undernourished dams reconstructed the nest more slowly than did well-nourished dams. Taken together, these results suggest that pup retrieval behavior of the undernourished mother is not impaired by dietary restriction when the maternal environment is disturbed minimally.
Resumo:
The effects of L-histidine (LH) on anxiety and memory retrieval were investigated in adult male Swiss Albino mice (weight 30-35 g) using the elevated plus-maze. The test was performed on two consecutive days: trial 1 (T1) and trial 2 (T2). In T1, mice received an intraperitoneal injection of saline (SAL) or LH before the test and were then injected again and retested 24 h later. LH had no effect on anxiety at the dose of 200 mg/kg since there was no difference between the SAL-SAL and LH-LH groups at T1 regarding open-arm entries (OAE) and open-arm time (OAT) (mean ± SEM; OAE: 4.0 ± 0.71, 4.80 ± 1.05; OAT: 40.55 ± 9.90, 51.55 ± 12.10, respectively; P > 0.05, Kruskal-Wallis test), or at the dose of 500 mg/kg (OAE: 5.27 ± 0.73, 4.87 ± 0.66; OAT: 63.93 ± 11.72, 63.58 ± 10.22; P > 0.05, Fisher LSD test). At T2, LH-LH animals did not reduce open-arm activity (OAE and OAT) at the dose of 200 mg/kg (T1: 4.87 ± 0.66, T2: 5.47 ± 1.05; T1: 63.58 ± 10.22; T2: 49.01 ± 8.43 for OAE and OAT, respectively; P > 0.05, Wilcoxon test) or at the dose of 500 mg/kg (T1: 4.80 ± 1.60, T2: 4.70 ± 1.04; T1: 51.55 ± 12.10, T2: 43.88 ± 10.64 for OAE and OAT, respectively; P > 0.05, Fisher LSD test), showing an inability to evoke memory 24 h later. These data suggest that LH does not act on anxiety but does induce a state-dependent memory retrieval deficit in mice.
Resumo:
Les moteurs de recherche font partie de notre vie quotidienne. Actuellement, plus d’un tiers de la population mondiale utilise l’Internet. Les moteurs de recherche leur permettent de trouver rapidement les informations ou les produits qu'ils veulent. La recherche d'information (IR) est le fondement de moteurs de recherche modernes. Les approches traditionnelles de recherche d'information supposent que les termes d'indexation sont indépendants. Pourtant, les termes qui apparaissent dans le même contexte sont souvent dépendants. L’absence de la prise en compte de ces dépendances est une des causes de l’introduction de bruit dans le résultat (résultat non pertinents). Certaines études ont proposé d’intégrer certains types de dépendance, tels que la proximité, la cooccurrence, la contiguïté et de la dépendance grammaticale. Dans la plupart des cas, les modèles de dépendance sont construits séparément et ensuite combinés avec le modèle traditionnel de mots avec une importance constante. Par conséquent, ils ne peuvent pas capturer correctement la dépendance variable et la force de dépendance. Par exemple, la dépendance entre les mots adjacents "Black Friday" est plus importante que celle entre les mots "road constructions". Dans cette thèse, nous étudions différentes approches pour capturer les relations des termes et de leurs forces de dépendance. Nous avons proposé des méthodes suivantes: ─ Nous réexaminons l'approche de combinaison en utilisant différentes unités d'indexation pour la RI monolingue en chinois et la RI translinguistique entre anglais et chinois. En plus d’utiliser des mots, nous étudions la possibilité d'utiliser bi-gramme et uni-gramme comme unité de traduction pour le chinois. Plusieurs modèles de traduction sont construits pour traduire des mots anglais en uni-grammes, bi-grammes et mots chinois avec un corpus parallèle. Une requête en anglais est ensuite traduite de plusieurs façons, et un score classement est produit avec chaque traduction. Le score final de classement combine tous ces types de traduction. Nous considérons la dépendance entre les termes en utilisant la théorie d’évidence de Dempster-Shafer. Une occurrence d'un fragment de texte (de plusieurs mots) dans un document est considérée comme représentant l'ensemble de tous les termes constituants. La probabilité est assignée à un tel ensemble de termes plutôt qu’a chaque terme individuel. Au moment d’évaluation de requête, cette probabilité est redistribuée aux termes de la requête si ces derniers sont différents. Cette approche nous permet d'intégrer les relations de dépendance entre les termes. Nous proposons un modèle discriminant pour intégrer les différentes types de dépendance selon leur force et leur utilité pour la RI. Notamment, nous considérons la dépendance de contiguïté et de cooccurrence à de différentes distances, c’est-à-dire les bi-grammes et les paires de termes dans une fenêtre de 2, 4, 8 et 16 mots. Le poids d’un bi-gramme ou d’une paire de termes dépendants est déterminé selon un ensemble des caractères, en utilisant la régression SVM. Toutes les méthodes proposées sont évaluées sur plusieurs collections en anglais et/ou chinois, et les résultats expérimentaux montrent que ces méthodes produisent des améliorations substantielles sur l'état de l'art.
Resumo:
In this paper a method of copy detection in short Malayalam text passages is proposed. Given two passages one as the source text and another as the copied text it is determined whether the second passage is plagiarized version of the source text. An algorithm for plagiarism detection using the n-gram model for word retrieval is developed and found tri-grams as the best model for comparing the Malayalam text. Based on the probability and the resemblance measures calculated from the n-gram comparison , the text is categorized on a threshold. Texts are compared by variable length n-gram(n={2,3,4}) comparisons. The experiments show that trigram model gives the average acceptable performance with affordable cost in terms of complexity
Resumo:
This class introduces basics of web mining and information retrieval including, for example, an introduction to the Vector Space Model and Text Mining. Guest Lecturer: Dr. Michael Granitzer Optional: Modeling the Internet and the Web: Probabilistic Methods and Algorithms, Pierre Baldi, Paolo Frasconi, Padhraic Smyth, Wiley, 2003 (Chapter 4, Text Analysis)
Resumo:
Finding journal articles from full text sources such as IEEEXplore, ACM and LNCS (Lecture Noters in Computer Science)
Resumo:
The s–x model of microwave emission from soil and vegetation layers is widely used to estimate soil moisture content from passive microwave observations. Its application to prospective satellite-based observations aggregating several thousand square kilometres requires understanding of the effects of scene heterogeneity. The effects of heterogeneity in soil surface roughness, soil moisture, water area and vegetation density on the retrieval of soil moisture from simulated single- and multi-angle observing systems were tested. Uncertainty in water area proved the most serious problem for both systems, causing errors of a few percent in soil moisture retrieval. Single-angle retrieval was largely unaffected by the other factors studied here. Multiple-angle retrievals errors around one percent arose from heterogeneity in either soil roughness or soil moisture. Errors of a few percent were caused by vegetation heterogeneity. A simple extension of the model vegetation representation was shown to reduce this error substantially for scenes containing a range of vegetation types.
Resumo:
The potential of the τ-ω model for retrieving the volumetric moisture content of bare and vegetated soil from dual polarisation passive microwave data acquired at single and multiple angles is tested. Measurement error and several additional sources of uncertainty will affect the theoretical retrieval accuracy. These include uncertainty in the soil temperature, the vegetation structure and consequently its microwave singlescattering albedo, and uncertainty in soil microwave emissivity based on its roughness. To test the effects of these uncertainties for simple homogeneous scenes, we attempt to retrieve soil moisture from a number of simulated microwave brightness temperature datasets generated using the τ-ω model. The uncertainties for each influence are estimated and applied to curves generated for typical scenarios, and an inverse model used to retrieve the soil moisture content, vegetation optical depth and soil temperature. The effect of each influence on the theoretical soil moisture retrieval limit is explored, the likelihood of each sensor configuration meeting user requirements is assessed, and the most effective means of improving moisture retrieval indicated.
Resumo:
In any data mining applications, automated text and text and image retrieval of information is needed. This becomes essential with the growth of the Internet and digital libraries. Our approach is based on the latent semantic indexing (LSI) and the corresponding term-by-document matrix suggested by Berry and his co-authors. Instead of using deterministic methods to find the required number of first "k" singular triplets, we propose a stochastic approach. First, we use Monte Carlo method to sample and to build much smaller size term-by-document matrix (e.g. we build k x k matrix) from where we then find the first "k" triplets using standard deterministic methods. Second, we investigate how we can reduce the problem to finding the "k"-largest eigenvalues using parallel Monte Carlo methods. We apply these methods to the initial matrix and also to the reduced one. The algorithms are running on a cluster of workstations under MPI and results of the experiments arising in textual retrieval of Web documents as well as comparison of the stochastic methods proposed are presented. (C) 2003 IMACS. Published by Elsevier Science B.V. All rights reserved.
Resumo:
A new Bayesian algorithm for retrieving surface rain rate from Tropical Rainfall Measuring Mission (TRMM) Microwave Imager (TMI) over the ocean is presented, along with validations against estimates from the TRMM Precipitation Radar (PR). The Bayesian approach offers a rigorous basis for optimally combining multichannel observations with prior knowledge. While other rain-rate algorithms have been published that are based at least partly on Bayesian reasoning, this is believed to be the first self-contained algorithm that fully exploits Bayes’s theorem to yield not just a single rain rate, but rather a continuous posterior probability distribution of rain rate. To advance the understanding of theoretical benefits of the Bayesian approach, sensitivity analyses have been conducted based on two synthetic datasets for which the “true” conditional and prior distribution are known. Results demonstrate that even when the prior and conditional likelihoods are specified perfectly, biased retrievals may occur at high rain rates. This bias is not the result of a defect of the Bayesian formalism, but rather represents the expected outcome when the physical constraint imposed by the radiometric observations is weak owing to saturation effects. It is also suggested that both the choice of the estimators and the prior information are crucial to the retrieval. In addition, the performance of the Bayesian algorithm herein is found to be comparable to that of other benchmark algorithms in real-world applications, while having the additional advantage of providing a complete continuous posterior probability distribution of surface rain rate.