992 resultados para type inference
Resumo:
Background: A current challenge in gene annotation is to define the gene function in the context of the network of relationships instead of using single genes. The inference of gene networks (GNs) has emerged as an approach to better understand the biology of the system and to study how several components of this network interact with each other and keep their functions stable. However, in general there is no sufficient data to accurately recover the GNs from their expression levels leading to the curse of dimensionality, in which the number of variables is higher than samples. One way to mitigate this problem is to integrate biological data instead of using only the expression profiles in the inference process. Nowadays, the use of several biological information in inference methods had a significant increase in order to better recover the connections between genes and reduce the false positives. What makes this strategy so interesting is the possibility of confirming the known connections through the included biological data, and the possibility of discovering new relationships between genes when observed the expression data. Although several works in data integration have increased the performance of the network inference methods, the real contribution of adding each type of biological information in the obtained improvement is not clear. Methods: We propose a methodology to include biological information into an inference algorithm in order to assess its prediction gain by using biological information and expression profile together. We also evaluated and compared the gain of adding four types of biological information: (a) protein-protein interaction, (b) Rosetta stone fusion proteins, (c) KEGG and (d) KEGG+GO. Results and conclusions: This work presents a first comparison of the gain in the use of prior biological information in the inference of GNs by considering the eukaryote (P. falciparum) organism. Our results indicates that information based on direct interaction can produce a higher improvement in the gain than data about a less specific relationship as GO or KEGG. Also, as expected, the results show that the use of biological information is a very important approach for the improvement of the inference. We also compared the gain in the inference of the global network and only the hubs. The results indicates that the use of biological information can improve the identification of the most connected proteins.
Resumo:
This paper considers likelihood-based inference for the family of power distributions. Widely applicable results are presented which can be used to conduct inference for all three parameters of the general location-scale extension of the family. More specific results are given for the special case of the power normal model. The analysis of a large data set, formed from density measurements for a certain type of pollen, illustrates the application of the family and the results for likelihood-based inference. Throughout, comparisons are made with analogous results for the direct parametrisation of the skew-normal distribution.
Resumo:
We present a novel analysis for relating the sizes of terms and subterms occurring at diferent argument positions in logic predicates. We extend and enrich the concept of sized type as a representation that incorporates structural (shape) information and allows expressing both lower and upper bounds on the size of a set of terms and their subterms at any position and depth. For example, expressing bounds on the length of lists of numbers, together with bounds on the values of all of their elements. The analysis is developed using abstract interpretation and the novel abstract operations are based on setting up and solving recurrence relations between sized types. It has been integrated, together with novel resource usage and cardinality analyses, in the abstract interpretation framework in the Ciao preprocessor, CiaoPP, in order to assess both the accuracy of the new size analysis and its usefulness in the resource usage estimation application. We show that the proposed sized types are a substantial improvement over the previous size analyses present in CiaoPP, and also benefit the resource analysis considerably, allowing the inference of equal or better bounds than comparable state of the art systems.
Resumo:
We devise a message passing algorithm for probabilistic inference in composite systems, consisting of a large number of variables, that exhibit weak random interactions among all variables and strong interactions with a small subset of randomly chosen variables; the relative strength of the two interactions is controlled by a free parameter. We examine the performance of the algorithm numerically on a number of systems of this type for varying mixing parameter values.
Resumo:
Diffusion processes are a family of continuous-time continuous-state stochastic processes that are in general only partially observed. The joint estimation of the forcing parameters and the system noise (volatility) in these dynamical systems is a crucial, but non-trivial task, especially when the system is nonlinear and multimodal. We propose a variational treatment of diffusion processes, which allows us to compute type II maximum likelihood estimates of the parameters by simple gradient techniques and which is computationally less demanding than most MCMC approaches. We also show how a cheap estimate of the posterior over the parameters can be constructed based on the variational free energy.
Resumo:
Many modern applications fall into the category of "large-scale" statistical problems, in which both the number of observations n and the number of features or parameters p may be large. Many existing methods focus on point estimation, despite the continued relevance of uncertainty quantification in the sciences, where the number of parameters to estimate often exceeds the sample size, despite huge increases in the value of n typically seen in many fields. Thus, the tendency in some areas of industry to dispense with traditional statistical analysis on the basis that "n=all" is of little relevance outside of certain narrow applications. The main result of the Big Data revolution in most fields has instead been to make computation much harder without reducing the importance of uncertainty quantification. Bayesian methods excel at uncertainty quantification, but often scale poorly relative to alternatives. This conflict between the statistical advantages of Bayesian procedures and their substantial computational disadvantages is perhaps the greatest challenge facing modern Bayesian statistics, and is the primary motivation for the work presented here.
Two general strategies for scaling Bayesian inference are considered. The first is the development of methods that lend themselves to faster computation, and the second is design and characterization of computational algorithms that scale better in n or p. In the first instance, the focus is on joint inference outside of the standard problem of multivariate continuous data that has been a major focus of previous theoretical work in this area. In the second area, we pursue strategies for improving the speed of Markov chain Monte Carlo algorithms, and characterizing their performance in large-scale settings. Throughout, the focus is on rigorous theoretical evaluation combined with empirical demonstrations of performance and concordance with the theory.
One topic we consider is modeling the joint distribution of multivariate categorical data, often summarized in a contingency table. Contingency table analysis routinely relies on log-linear models, with latent structure analysis providing a common alternative. Latent structure models lead to a reduced rank tensor factorization of the probability mass function for multivariate categorical data, while log-linear models achieve dimensionality reduction through sparsity. Little is known about the relationship between these notions of dimensionality reduction in the two paradigms. In Chapter 2, we derive several results relating the support of a log-linear model to nonnegative ranks of the associated probability tensor. Motivated by these findings, we propose a new collapsed Tucker class of tensor decompositions, which bridge existing PARAFAC and Tucker decompositions, providing a more flexible framework for parsimoniously characterizing multivariate categorical data. Taking a Bayesian approach to inference, we illustrate empirical advantages of the new decompositions.
Latent class models for the joint distribution of multivariate categorical, such as the PARAFAC decomposition, data play an important role in the analysis of population structure. In this context, the number of latent classes is interpreted as the number of genetically distinct subpopulations of an organism, an important factor in the analysis of evolutionary processes and conservation status. Existing methods focus on point estimates of the number of subpopulations, and lack robust uncertainty quantification. Moreover, whether the number of latent classes in these models is even an identified parameter is an open question. In Chapter 3, we show that when the model is properly specified, the correct number of subpopulations can be recovered almost surely. We then propose an alternative method for estimating the number of latent subpopulations that provides good quantification of uncertainty, and provide a simple procedure for verifying that the proposed method is consistent for the number of subpopulations. The performance of the model in estimating the number of subpopulations and other common population structure inference problems is assessed in simulations and a real data application.
In contingency table analysis, sparse data is frequently encountered for even modest numbers of variables, resulting in non-existence of maximum likelihood estimates. A common solution is to obtain regularized estimates of the parameters of a log-linear model. Bayesian methods provide a coherent approach to regularization, but are often computationally intensive. Conjugate priors ease computational demands, but the conjugate Diaconis--Ylvisaker priors for the parameters of log-linear models do not give rise to closed form credible regions, complicating posterior inference. In Chapter 4 we derive the optimal Gaussian approximation to the posterior for log-linear models with Diaconis--Ylvisaker priors, and provide convergence rate and finite-sample bounds for the Kullback-Leibler divergence between the exact posterior and the optimal Gaussian approximation. We demonstrate empirically in simulations and a real data application that the approximation is highly accurate, even in relatively small samples. The proposed approximation provides a computationally scalable and principled approach to regularized estimation and approximate Bayesian inference for log-linear models.
Another challenging and somewhat non-standard joint modeling problem is inference on tail dependence in stochastic processes. In applications where extreme dependence is of interest, data are almost always time-indexed. Existing methods for inference and modeling in this setting often cluster extreme events or choose window sizes with the goal of preserving temporal information. In Chapter 5, we propose an alternative paradigm for inference on tail dependence in stochastic processes with arbitrary temporal dependence structure in the extremes, based on the idea that the information on strength of tail dependence and the temporal structure in this dependence are both encoded in waiting times between exceedances of high thresholds. We construct a class of time-indexed stochastic processes with tail dependence obtained by endowing the support points in de Haan's spectral representation of max-stable processes with velocities and lifetimes. We extend Smith's model to these max-stable velocity processes and obtain the distribution of waiting times between extreme events at multiple locations. Motivated by this result, a new definition of tail dependence is proposed that is a function of the distribution of waiting times between threshold exceedances, and an inferential framework is constructed for estimating the strength of extremal dependence and quantifying uncertainty in this paradigm. The method is applied to climatological, financial, and electrophysiology data.
The remainder of this thesis focuses on posterior computation by Markov chain Monte Carlo. The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov transition kernel. Comparatively little attention has been paid to convergence and estimation error in these approximating Markov Chains. In Chapter 6, we propose a framework for assessing when to use approximations in MCMC algorithms, and how much error in the transition kernel should be tolerated to obtain optimal estimation performance with respect to a specified loss function and computational budget. The results require only ergodicity of the exact kernel and control of the kernel approximation accuracy. The theoretical framework is applied to approximations based on random subsets of data, low-rank approximations of Gaussian processes, and a novel approximating Markov chain for discrete mixture models.
Data augmentation Gibbs samplers are arguably the most popular class of algorithm for approximately sampling from the posterior distribution for the parameters of generalized linear models. The truncated Normal and Polya-Gamma data augmentation samplers are standard examples for probit and logit links, respectively. Motivated by an important problem in quantitative advertising, in Chapter 7 we consider the application of these algorithms to modeling rare events. We show that when the sample size is large but the observed number of successes is small, these data augmentation samplers mix very slowly, with a spectral gap that converges to zero at a rate at least proportional to the reciprocal of the square root of the sample size up to a log factor. In simulation studies, moderate sample sizes result in high autocorrelations and small effective sample sizes. Similar empirical results are observed for related data augmentation samplers for multinomial logit and probit models. When applied to a real quantitative advertising dataset, the data augmentation samplers mix very poorly. Conversely, Hamiltonian Monte Carlo and a type of independence chain Metropolis algorithm show good mixing on the same dataset.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
Resumo:
In physics, one attempts to infer the rules governing a system given only the results of imperfect measurements. Hence, microscopic theories may be effectively indistinguishable experimentally. We develop an operationally motivated procedure to identify the corresponding equivalence classes of states, and argue that the renormalization group (RG) arises from the inherent ambiguities associated with the classes: one encounters flow parameters as, e.g., a regulator, a scale, or a measure of precision, which specify representatives in a given equivalence class. This provides a unifying framework and reveals the role played by information in renormalization. We validate this idea by showing that it justifies the use of low-momenta n-point functions as statistically relevant observables around a Gaussian hypothesis. These results enable the calculation of distinguishability in quantum field theory. Our methods also provide a way to extend renormalization techniques to effective models which are not based on the usual quantum-field formalism, and elucidates the relationships between various type of RG.
Resumo:
In this study, 103 unrelated South-American patients with mucopolysaccharidosis type II (MPS II) were investigated aiming at the identification of iduronate-2-sulfatase (IDS) disease causing mutations and the possibility of some insights on the genotype-phenotype correlation The strategy used for genotyping involved the identification of the previously reported inversion/disruption of the IDS gene by PCR and screening for other mutations by PCR/SSCP. The exons with altered mobility on SSCP were sequenced, as well as all the exons of patients with no SSCP alteration. By using this strategy, we were able to find the pathogenic mutation in all patients. Alterations such as inversion/disruption and partial/total deletions of the IDS gene were found in 20/103 (19%) patients. Small insertions/deletions/indels (<22 bp) and point mutations were identified in 83/103 (88%) patients, including 30 novel mutations; except for a higher frequency of small duplications in relation to small deletions, the frequencies of major and minor alterations found in our sample are in accordance with those described in the literature.
Resumo:
New DNA-based predictive tests for physical characteristics and inference of ancestry are highly informative tools that are being increasingly used in forensic genetic analysis. Two eye colour prediction models: a Bayesian classifier - Snipper and a multinomial logistic regression (MLR) system for the Irisplex assay, have been described for the analysis of unadmixed European populations. Since multiple SNPs in combination contribute in varying degrees to eye colour predictability in Europeans, it is likely that these predictive tests will perform in different ways amongst admixed populations that have European co-ancestry, compared to unadmixed Europeans. In this study we examined 99 individuals from two admixed South American populations comparing eye colour versus ancestry in order to reveal a direct correlation of light eye colour phenotypes with European co-ancestry in admixed individuals. Additionally, eye colour prediction following six prediction models, using varying numbers of SNPs and based on Snipper and MLR, were applied to the study populations. Furthermore, patterns of eye colour prediction have been inferred for a set of publicly available admixed and globally distributed populations from the HGDP-CEPH panel and 1000 Genomes databases with a special emphasis on admixed American populations similar to those of the study samples.
Resumo:
The taxonomic status of a disjunctive population of Phyllomedusa from southern Brazil was diagnosed using molecular, chromosomal, and morphological approaches, which resulted in the recognition of a new species of the P. hypochondrialis group. Here, we describe P. rustica sp. n. from the Atlantic Forest biome, found in natural highland grassland formations on a plateau in the south of Brazil. Phylogenetic inferences placed P. rustica sp. n. in a subclade that includes P. rhodei + all the highland species of the clade. Chromosomal morphology is conservative, supporting the inference of homologies among the karyotypes of the species of this genus. Phyllomedusa rustica is apparently restricted to its type-locality, and we discuss the potential impact on the strategies applied to the conservation of the natural grassland formations found within the Brazilian Atlantic Forest biome in southern Brazil. We suggest that conservation strategies should be modified to guarantee the preservation of this species.
Resumo:
Pyrimidine-5'-nucleotidase type I (P5'NI) deficiency is an autosomal recessive condition that causes nonspherocytic hemolytic anemia, characterized by marked basophilic stippling and pyrimidine nucleotide accumulation in erythrocytes. We herein present two African descendant patients, father and daughter, with P5'N deficiency, both born from first cousins. Investigation of the promoter polymorphism of the uridine diphospho glucuronosyl transferase 1A (UGT1A) gene revealed that the father was homozygous for the allele (TA7) and the daughter heterozygous (TA6/TA7). P5'NI gene (NT5C3) gene sequencing revealed a further change in homozygosity at amino acid position 56 (p.R56G), located in a highly conserved region. Both patients developed gallstones; however the father, who had undergone surgery for the removal of stones, had extremely severe intrahepatic cholestasis and, liver biopsy revealed fibrosis and siderosis grade III, leading us to believe that the homozygosity of the UGT1A polymorphism was responsible for the more severe clinical features in the father. Moreover, our results show how the clinical expression of hemolytic anemia is influenced by epistatic factors and we describe a new mutation in the P5'N gene associated with enzyme deficiency, iron overload, and severe gallstone formation. To our knowledge, this is the first description of P5'N deficiency in South Americans.
Resumo:
The over-production of reactive oxygen species (ROS) can cause oxidative damage to a large number of molecules, including DNA, and has been associated with the pathogenesis of several disorders, such as diabetes mellitus (DM), dyslipidemia and periodontitis (PD). We hypothesise that the presence of these diseases could proportionally increase the DNA damage. The aim of this study was to assess the micronucleus frequency (MNF), as a biomarker for DNA damage, in individuals with type 2 DM, dyslipidemia and PD. One hundred and fifty patients were divided into five groups based upon diabetic, dyslipidemic and periodontal status (Group 1 - poor controlled DM with dyslipidemia and PD; Group 2 - well-controlled DM with dyslipidemia and PD; Group 3 - without DM with dyslipidemia and PD; Group 4 - without DM, without dyslipidemia and with PD; and Group 5 - without DM, dyslipidemia and PD). Blood analyses were carried out for fasting plasma glucose, HbA1c and lipid profile. Periodontal examinations were performed, and venous blood was collected and processed for micronucleus (MN) assay. The frequency of micronuclei was evaluated by cell culture cytokinesis-block MN assay. The general characteristics of each group were described by the mean and standard deviation and the data were submitted to the Mann-Whitney, Kruskal-Wallis, Multiple Logistic Regression and Spearman tests. The Groups 1, 2 and 3 were similarly dyslipidemic presenting increased levels of total cholesterol, low density lipoprotein cholesterol and triglycerides. Periodontal tissue destruction and local inflammation were significantly more severe in diabetics, particularly in Group 1. Frequency of bi-nucleated cells with MN and MNF, as well as nucleoplasmic bridges, were significantly higher for poor controlled diabetics with dyslipidemia and PD in comparison with those systemically healthy, even after adjusting for age, and considering Bonferroni's correction. Elevated frequency of micronuclei was found in patients affected by type 2 diabetes, dyslipidemia and PD. This result suggests that these three pathologies occurring simultaneously promote an additional role to produce DNA impairment. In addition, the micronuclei assay was useful as a biomarker for DNA damage in individuals with chronic degenerative diseases.
Resumo:
Leaves of Passiflora alata Curtis were characterized for their antioxidant capacity. Antioxidant analyses of DPPH, FRAP, ABTS, ORAC and phenolic compounds were made in three different extracts: aqueous, methanol/acetone and ethanol. Aqueous extract was found to be the best solvent for recovery of phenolic compounds and antioxidant activity, when compared with methanol/acetone and ethanol. To study the anti-inflammatory properties of this extract in experimental type 1 diabetes, NOD mice were divided into two groups: the P. alata group, treated with aqueous extract of P. alata Curtis, and a non-treated control group, followed by diabetes expression analysis. The consumption of aqueous extract and water ad libitum lasted 28 weeks. The treated-group presented a decrease in diabetes incidence, a low quantity of infiltrative cells in pancreatic islets and increased glutathione in the kidney and liver (p<0.05), when compared with the diabetic and non-diabetic control-groups. In conclusion, our results suggest that the consumption of aqueous extract of P. alata may be considered a good source of natural antioxidants and compounds found in its composition can act as anti-inflammatory agents, helping in the control of diabetes.