932 resultados para Probabilistic latent semantic model
Resumo:
Text cohesion is an important element of discourse processing. This paper presents a new approach to modeling, quantifying, and visualizing text cohesion using automated cohesion flow indices that capture semantic links among paragraphs. Cohesion flow is calculated by applying Cohesion Network Analysis, a combination of semantic distances, Latent Semantic Analysis, and Latent Dirichlet Allocation, as well as Social Network Analysis. Experiments performed on 315 timed essays indicated that cohesion flow indices are significantly correlated with human ratings of text coherence and essay quality. Visualizations of the global cohesion indices are also included to support a more facile understanding of how cohesion flow impacts coherence in terms of semantic dependencies between paragraphs.
Resumo:
This paper introduces a novel, in-depth approach of analyzing the differences in writing style between two famous Romanian orators, based on automated textual complexity indices for Romanian language. The considered authors are: (a) Mihai Eminescu, Romania’s national poet and a remarkable journalist of his time, and (b) Ion C. Brătianu, one of the most important Romanian politicians from the middle of the 18th century. Both orators have a common journalistic interest consisting in their desire to spread the word about political issues in Romania via the printing press, the most important public voice at that time. In addition, both authors exhibit writing style particularities, and our aim is to explore these differences through our ReaderBench framework that computes a wide range of lexical and semantic textual complexity indices for Romanian and other languages. The used corpus contains two collections of speeches for each orator that cover the period 1857–1880. The results of this study highlight the lexical and cohesive textual complexity indices that reflect very well the differences in writing style, measures relying on Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) semantic models.
Resumo:
Leishmaniasis, caused by Leishmania infantum, is a vector-borne zoonotic disease that is endemic to the Mediterranean basin. The potential of rabbits and hares to serve as competent reservoirs for the disease has recently been demonstrated, although assessment of the importance of their role on disease dynamics is hampered by the absence of quantitative knowledge on the accuracy of diagnostic techniques in these species. A Bayesian latent-class model was used here to estimate the sensitivity and specificity of the Immuno-fluorescence antibody test (IFAT) in serum and a Leishmania-nested PCR (Ln-PCR) in skin for samples collected from 217 rabbits and 70 hares from two different populations in the region of Madrid, Spain. A two-population model, assuming conditional independence between test results and incorporating prior information on the performance of the tests in other animal species obtained from the literature, was used. Two alternative cut-off values were assumed for the interpretation of the IFAT results: 1/50 for conservative and 1/25 for sensitive interpretation. Results suggest that sensitivity and specificity of the IFAT were around 70–80%, whereas the Ln-PCR was highly specific (96%) but had a limited sensitivity (28.9% applying the conservative interpretation and 21.3% with the sensitive one). Prevalence was higher in the rabbit population (50.5% and 72.6%, for the conservative and sensitive interpretation, respectively) than in hares (6.7% and 13.2%). Our results demonstrate that the IFAT may be a useful screening tool for diagnosis of leishmaniasis in rabbits and hares. These results will help to design and implement surveillance programmes in wild species, with the ultimate objective of early detecting and preventing incursions of the disease into domestic and human populations.
Resumo:
Understanding how virus strains offer protection against closely related emerging strains is vital for creating effective vaccines. For many viruses, including Foot-and-Mouth Disease Virus (FMDV) and the Influenza virus where multiple serotypes often co-circulate, in vitro testing of large numbers of vaccines can be infeasible. Therefore the development of an in silico predictor of cross-protection between strains is important to help optimise vaccine choice. Vaccines will offer cross-protection against closely related strains, but not against those that are antigenically distinct. To be able to predict cross-protection we must understand the antigenic variability within a virus serotype, distinct lineages of a virus, and identify the antigenic residues and evolutionary changes that cause the variability. In this thesis we present a family of sparse hierarchical Bayesian models for detecting relevant antigenic sites in virus evolution (SABRE), as well as an extended version of the method, the extended SABRE (eSABRE) method, which better takes into account the data collection process. The SABRE methods are a family of sparse Bayesian hierarchical models that use spike and slab priors to identify sites in the viral protein which are important for the neutralisation of the virus. In this thesis we demonstrate how the SABRE methods can be used to identify antigenic residues within different serotypes and show how the SABRE method outperforms established methods, mixed-effects models based on forward variable selection or l1 regularisation, on both synthetic and viral datasets. In addition we also test a number of different versions of the SABRE method, compare conjugate and semi-conjugate prior specifications and an alternative to the spike and slab prior; the binary mask model. We also propose novel proposal mechanisms for the Markov chain Monte Carlo (MCMC) simulations, which improve mixing and convergence over that of the established component-wise Gibbs sampler. The SABRE method is then applied to datasets from FMDV and the Influenza virus in order to identify a number of known antigenic residue and to provide hypotheses of other potentially antigenic residues. We also demonstrate how the SABRE methods can be used to create accurate predictions of the important evolutionary changes of the FMDV serotypes. In this thesis we provide an extended version of the SABRE method, the eSABRE method, based on a latent variable model. The eSABRE method takes further into account the structure of the datasets for FMDV and the Influenza virus through the latent variable model and gives an improvement in the modelling of the error. We show how the eSABRE method outperforms the SABRE methods in simulation studies and propose a new information criterion for selecting the random effects factors that should be included in the eSABRE method; block integrated Widely Applicable Information Criterion (biWAIC). We demonstrate how biWAIC performs equally to two other methods for selecting the random effects factors and combine it with the eSABRE method to apply it to two large Influenza datasets. Inference in these large datasets is computationally infeasible with the SABRE methods, but as a result of the improved structure of the likelihood, we are able to show how the eSABRE method offers a computational improvement, leading it to be used on these datasets. The results of the eSABRE method show that we can use the method in a fully automatic manner to identify a large number of antigenic residues on a variety of the antigenic sites of two Influenza serotypes, as well as making predictions of a number of nearby sites that may also be antigenic and are worthy of further experiment investigation.
Resumo:
In this article, we describe a novel methodology to extract semantic characteristics from protein structures using linear algebra in order to compose structural signature vectors which may be used efficiently to compare and classify protein structures into fold families. These signatures are built from the pattern of hydrophobic intrachain interactions using Singular Value Decomposition (SVD) and Latent Semantic Indexing (LSI) techniques. Considering proteins as documents and contacts as terms, we have built a retrieval system which is able to find conserved contacts in samples of myoglobin fold family and to retrieve these proteins among proteins of varied folds with precision of up to 80%. The classifier is a web tool available at our laboratory website. Users can search for similar chains from a specific PDB, view and compare their contact maps and browse their structures using a JMol plug-in.
Resumo:
Web APIs have gained increasing popularity in recent Web service technology development owing to its simplicity of technology stack and the proliferation of mashups. However, efficiently discovering Web APIs and the relevant documentations on the Web is still a challenging task even with the best resources available on the Web. In this paper we cast the problem of detecting the Web API documentations as a text classification problem of classifying a given Web page as Web API associated or not. We propose a supervised generative topic model called feature latent Dirichlet allocation (feaLDA) which offers a generic probabilistic framework for automatic detection of Web APIs. feaLDA not only captures the correspondence between data and the associated class labels, but also provides a mechanism for incorporating side information such as labelled features automatically learned from data that can effectively help improving classification performance. Extensive experiments on our Web APIs documentation dataset shows that the feaLDA model outperforms three strong supervised baselines including naive Bayes, support vector machines, and the maximum entropy model, by over 3% in classification accuracy. In addition, feaLDA also gives superior performance when compared against other existing supervised topic models.
Resumo:
Alternative splicing of gene transcripts greatly expands the functional capacity of the genome, and certain splice isoforms may indicate specific disease states such as cancer. Splice junction microarrays interrogate thousands of splice junctions, but data analysis is difficult and error prone because of the increased complexity compared to differential gene expression analysis. We present Rank Change Detection (RCD) as a method to identify differential splicing events based upon a straightforward probabilistic model comparing the over-or underrepresentation of two or more competing isoforms. RCD has advantages over commonly used methods because it is robust to false positive errors due to nonlinear trends in microarray measurements. Further, RCD does not depend on prior knowledge of splice isoforms, yet it takes advantage of the inherent structure of mutually exclusive junctions, and it is conceptually generalizable to other types of splicing arrays or RNA-Seq. RCD specifically identifies the biologically important cases when a splice junction becomes more or less prevalent compared to other mutually exclusive junctions. The example data is from different cell lines of glioblastoma tumors assayed with Agilent microarrays.
Resumo:
Fatigue and crack propagation are phenomena affected by high uncertainties, where deterministic methods fail to predict accurately the structural life. The present work aims at coupling reliability analysis with boundary element method. The latter has been recognized as an accurate and efficient numerical technique to deal with mixed mode propagation, which is very interesting for reliability analysis. The coupled procedure allows us to consider uncertainties during the crack growth process. In addition, it computes the probability of fatigue failure for complex structural geometry and loading. Two coupling procedures are considered: direct coupling of reliability and mechanical solvers and indirect coupling by the response surface method. Numerical applications show the performance of the proposed models in lifetime assessment under uncertainties, where the direct method has shown faster convergence than response surface method. (C) 2010 Elsevier Ltd. All rights reserved.
Resumo:
We study the lysis timing of a bacteriophage population by means of a continuously infection-age-structured population dynamics model. The features of the model are the infection process of bacteria, the natural death process, and the lysis process which means the replication of bacteriophage viruses inside bacteria and the destruction of them. We consider that the length of the lysis timing (or latent period) is distributed according to a general probability distribution function. We have carried out an optimization procedure and we have found the latent period corresponding to the maximal fitness (i.e. maximal growth rate) of the bacteriophage population.
Resumo:
Dorsal and ventral pathways for syntacto-semantic speech processing in the left hemisphere are represented in the dual-stream model of auditory processing. Here we report new findings for the right dorsal and ventral temporo-frontal pathway during processing of affectively intonated speech (i.e. affective prosody) in humans, together with several left hemispheric structural connections, partly resembling those for syntacto-semantic speech processing. We investigated white matter fiber connectivity between regions responding to affective prosody in several subregions of the bilateral superior temporal cortex (secondary and higher-level auditory cortex) and of the inferior frontal cortex (anterior and posterior inferior frontal gyrus). The fiber connectivity was investigated by using probabilistic diffusion tensor based tractography. The results underscore several so far underestimated auditory pathway connections, especially for the processing of affective prosody, such as a right ventral auditory pathway. The results also suggest the existence of a dual-stream processing in the right hemisphere, and a general predominance of the dorsal pathways in both hemispheres underlying the neural processing of affective prosody in an extended temporo-frontal network.
Resumo:
Probabilistic inversion methods based on Markov chain Monte Carlo (MCMC) simulation are well suited to quantify parameter and model uncertainty of nonlinear inverse problems. Yet, application of such methods to CPU-intensive forward models can be a daunting task, particularly if the parameter space is high dimensional. Here, we present a 2-D pixel-based MCMC inversion of plane-wave electromagnetic (EM) data. Using synthetic data, we investigate how model parameter uncertainty depends on model structure constraints using different norms of the likelihood function and the model constraints, and study the added benefits of joint inversion of EM and electrical resistivity tomography (ERT) data. Our results demonstrate that model structure constraints are necessary to stabilize the MCMC inversion results of a highly discretized model. These constraints decrease model parameter uncertainty and facilitate model interpretation. A drawback is that these constraints may lead to posterior distributions that do not fully include the true underlying model, because some of its features exhibit a low sensitivity to the EM data, and hence are difficult to resolve. This problem can be partly mitigated if the plane-wave EM data is augmented with ERT observations. The hierarchical Bayesian inverse formulation introduced and used herein is able to successfully recover the probabilistic properties of the measurement data errors and a model regularization weight. Application of the proposed inversion methodology to field data from an aquifer demonstrates that the posterior mean model realization is very similar to that derived from a deterministic inversion with similar model constraints.
Resumo:
We study the lysis timing of a bacteriophage population by means of a continuously infection-age-structured population dynamics model. The features of the model are the infection process of bacteria, the death process, and the lysis process which means the replication of bacteriophage viruses inside bacteria and the destruction of them. The time till lysis (or latent period) is assumed to have an arbitrary distribution. We have carried out an optimization procedure, and we have found that the latent period corresponding to maximal fitness (i.e. maximal growth rate of the bacteriophage population) is of fixed length. We also study the dependence of the optimal latent period on the amount of susceptible bacteria and the number of virions released by a single infection. Finally, the evolutionarily stable strategy of the latent period is also determined as a fixed period taking into account that super-infections are not considered
Resumo:
The ability of thymidine kinase (tk)-deleted recombinant bovine herpesvirus 5 (BoHV-5tkΔ) to establish and reactivate latent infection was investigated in lambs. During acute infection, the recombinant virus replicated moderately in the nasal mucosa, yet to lower titers than the parental strain. At day 40 post-infection (pi), latent viral DNA was detected in trigeminal ganglia (TG) of all lambs in both groups. However, the amount of recombinant viral DNA in TGs was lower (9.7-fold less) than that of the parental virus as determined by quantitative real time PCR. Thus, tk deletion had no apparent effect on the frequency of latent infection but reduced colonization of TG. Upon dexamethasone (Dx) administration at day 40 pi, lambs inoculated with parental virus shed infectious virus in nasal secretions, contrasting with lack of infectivity in secretions of lambs inoculated with the recombinant virus. Nevertheless, some nasal swabs from the recombinant virus group were positive for viral DNA by PCR, indicating low levels of reactivation. Thus, BoHV-5 TK activity is not required for establishment of latency, but seems critical for efficient virus reactivation upon Dx treatment.