930 resultados para PHYLOGENETIC INFERENCE
Resumo:
Consider a model with parameter phi, and an auxiliary model with parameter theta. Let phi be a randomly sampled from a given density over the known parameter space. Monte Carlo methods can be used to draw simulated data and compute the corresponding estimate of theta, say theta_tilde. A large set of tuples (phi, theta_tilde) can be generated in this manner. Nonparametric methods may be use to fit the function E(phi|theta_tilde=a), using these tuples. It is proposed to estimate phi using the fitted E(phi|theta_tilde=theta_hat), where theta_hat is the auxiliary estimate, using the real sample data. This is a consistent and asymptotically normally distributed estimator, under certain assumptions. Monte Carlo results for dynamic panel data and vector autoregressions show that this estimator can have very attractive small sample properties. Confidence intervals can be constructed using the quantiles of the phi for which theta_tilde is close to theta_hat. Such confidence intervals are found to have very accurate coverage.
Resumo:
Continuing developments in science and technology mean that the amounts of information forensic scientists are able to provide for criminal investigations is ever increasing. The commensurate increase in complexity creates difficulties for scientists and lawyers with regard to evaluation and interpretation, notably with respect to issues of inference and decision. Probability theory, implemented through graphical methods, and specifically Bayesian networks, provides powerful methods to deal with this complexity. Extensions of these methods to elements of decision theory provide further support and assistance to the judicial system. Bayesian Networks for Probabilistic Inference and Decision Analysis in Forensic Science provides a unique and comprehensive introduction to the use of Bayesian decision networks for the evaluation and interpretation of scientific findings in forensic science, and for the support of decision-makers in their scientific and legal tasks. Includes self-contained introductions to probability and decision theory. Develops the characteristics of Bayesian networks, object-oriented Bayesian networks and their extension to decision models. Features implementation of the methodology with reference to commercial and academically available software. Presents standard networks and their extensions that can be easily implemented and that can assist in the reader's own analysis of real cases. Provides a technique for structuring problems and organizing data based on methods and principles of scientific reasoning. Contains a method for the construction of coherent and defensible arguments for the analysis and evaluation of scientific findings and for decisions based on them. Is written in a lucid style, suitable for forensic scientists and lawyers with minimal mathematical background. Includes a foreword by Ian Evett. The clear and accessible style of this second edition makes this book ideal for all forensic scientists, applied statisticians and graduate students wishing to evaluate forensic findings from the perspective of probability and decision analysis. It will also appeal to lawyers and other scientists and professionals interested in the evaluation and interpretation of forensic findings, including decision making based on scientific information.
Resumo:
Considering genetic relatedness among species has long been argued as an important step toward measuring biological diversity more accurately, rather than relying solely on species richness. Some researchers have correlated measures of phylogenetic diversity and species richness across a series of sites and suggest that values of phylogenetic diversity do not differ enough from those of species richness to justify their inclusion in conservation planning. We compared predictions of species richness and 10 measures of phylogenetic diversity by creating distribution models for 168 individual species of a species-rich plant family, the Cape Proteaceae. When we used average amounts of land set aside for conservation to compare areas selected on the basis of species richness with areas selected on the basis of phylogenetic diversity, correlations between species richness and different measures of phylogenetic diversity varied considerably. Correlations between species richness and measures that were based on the length of phylogenetic tree branches and tree shape were weaker than those that were based on tree shape alone. Elevation explained up to 31% of the segregation of species rich versus phylogenetically rich areas. Given these results, the increased availability of molecular data, and the known ecological effect of phylogenetically rich communities, consideration of phylogenetic diversity in conservation decision making may be feasible and informative.
Multimodel inference and multimodel averaging in empirical modeling of occupational exposure levels.
Resumo:
Empirical modeling of exposure levels has been popular for identifying exposure determinants in occupational hygiene. Traditional data-driven methods used to choose a model on which to base inferences have typically not accounted for the uncertainty linked to the process of selecting the final model. Several new approaches propose making statistical inferences from a set of plausible models rather than from a single model regarded as 'best'. This paper introduces the multimodel averaging approach described in the monograph by Burnham and Anderson. In their approach, a set of plausible models are defined a priori by taking into account the sample size and previous knowledge of variables influent on exposure levels. The Akaike information criterion is then calculated to evaluate the relative support of the data for each model, expressed as Akaike weight, to be interpreted as the probability of the model being the best approximating model given the model set. The model weights can then be used to rank models, quantify the evidence favoring one over another, perform multimodel prediction, estimate the relative influence of the potential predictors and estimate multimodel-averaged effects of determinants. The whole approach is illustrated with the analysis of a data set of 1500 volatile organic compound exposure levels collected by the Institute for work and health (Lausanne, Switzerland) over 20 years, each concentration having been divided by the relevant Swiss occupational exposure limit and log-transformed before analysis. Multimodel inference represents a promising procedure for modeling exposure levels that incorporates the notion that several models can be supported by the data and permits to evaluate to a certain extent model selection uncertainty, which is seldom mentioned in current practice.
Resumo:
Given a sample from a fully specified parametric model, let Zn be a given finite-dimensional statistic - for example, an initial estimator or a set of sample moments. We propose to (re-)estimate the parameters of the model by maximizing the likelihood of Zn. We call this the maximum indirect likelihood (MIL) estimator. We also propose a computationally tractable Bayesian version of the estimator which we refer to as a Bayesian Indirect Likelihood (BIL) estimator. In most cases, the density of the statistic will be of unknown form, and we develop simulated versions of the MIL and BIL estimators. We show that the indirect likelihood estimators are consistent and asymptotically normally distributed, with the same asymptotic variance as that of the corresponding efficient two-step GMM estimator based on the same statistic. However, our likelihood-based estimators, by taking into account the full finite-sample distribution of the statistic, are higher order efficient relative to GMM-type estimators. Furthermore, in many cases they enjoy a bias reduction property similar to that of the indirect inference estimator. Monte Carlo results for a number of applications including dynamic and nonlinear panel data models, a structural auction model and two DSGE models show that the proposed estimators indeed have attractive finite sample properties.
Resumo:
Restriction site-associated DNA sequencing (RADseq) provides researchers with the ability to record genetic polymorphism across thousands of loci for nonmodel organisms, potentially revolutionizing the field of molecular ecology. However, as with other genotyping methods, RADseq is prone to a number of sources of error that may have consequential effects for population genetic inferences, and these have received only limited attention in terms of the estimation and reporting of genotyping error rates. Here we use individual sample replicates, under the expectation of identical genotypes, to quantify genotyping error in the absence of a reference genome. We then use sample replicates to (i) optimize de novo assembly parameters within the program Stacks, by minimizing error and maximizing the retrieval of informative loci; and (ii) quantify error rates for loci, alleles and single-nucleotide polymorphisms. As an empirical example, we use a double-digest RAD data set of a nonmodel plant species, Berberis alpina, collected from high-altitude mountains in Mexico.
Resumo:
Phenological events - defined points in the life cycle of a plant or animal - have been regarded as highly plastic traits, reflecting flexible responses to various environmental cues. The ability of a species to track, via shifts in phenological events, the abiotic environment through time might dictate its vulnerability to future climate change. Understanding the predictors and drivers of phenological change is therefore critical. Here, we evaluated evidence for phylogenetic conservatism - the tendency for closely related species to share similar ecological and biological attributes - in phenological traits across flowering plants. We aggregated published and unpublished data on timing of first flower and first leaf, encompassing 4000 species at 23 sites across the Northern Hemisphere. We reconstructed the phylogeny for the set of included species, first, using the software program Phylomatic, and second, from DNA data. We then quantified phylogenetic conservatism in plant phenology within and across sites. We show that more closely related species tend to flower and leaf at similar times. By contrasting mean flowering times within and across sites, however, we illustrate that it is not the time of year that is conserved, but rather the phenological responses to a common set of abiotic cues. Our findings suggest that species cannot be treated as statistically independent when modelling phenological responses.Synthesis. Closely related species tend to resemble each other in the timing of their life-history events, a likely product of evolutionarily conserved responses to environmental cues. The search for the underlying drivers of phenology must therefore account for species' shared evolutionary histories.
Resumo:
Cloud computing has recently become very popular, and several bioinformatics applications exist already in that domain. The aim of this article is to analyse a current cloud system with respect to usability, benchmark its performance and compare its user friendliness with a conventional cluster job submission system. Given the current hype on the theme, user expectations are rather high, but current results show that neither the price/performance ratio nor the usage model is very satisfactory for large-scale embarrassingly parallel applications. However, for small to medium scale applications that require CPU time at certain peak times the cloud is a suitable alternative.
Resumo:
The restriction fragment length polymorphism of the 195 bp repeated DNA sequence of Trypanosoma cruzi was analyzed among 23 T. cruzi stocks giving a reliable picture of the whole phylogenetic variability of the species. The profiles observed with the enzymes Hinf I and Hae III were linked together and supported the existence of two groups. Group 1 shows a 195 bp repeated unit (Hinf I) and high molecular weight DNA (Hae III), while group 2 presents a ladder profile for each enzyme, which is a characteristic of tandemly repeated DNA. The two groups, respectively, clustered stocks pertaining to the two principal lineages evidenced by isoenzyme and RAPD markers. The congruence among these three independent genomic markers corroborates the existence of two real phylogenetic lineages in T. cruzi. The specific monomorphic profiles for each major phylogenetic lineage suggest the existence of ancient sexuality and cryptic biological speciation.
Improving the performance of positive selection inference by filtering unreliable alignment regions.
Resumo:
Errors in the inferred multiple sequence alignment may lead to false prediction of positive selection. Recently, methods for detecting unreliable alignment regions were developed and were shown to accurately identify incorrectly aligned regions. While removing unreliable alignment regions is expected to increase the accuracy of positive selection inference, such filtering may also significantly decrease the power of the test, as positively selected regions are fast evolving, and those same regions are often those that are difficult to align. Here, we used realistic simulations that mimic sequence evolution of HIV-1 genes to test the hypothesis that the performance of positive selection inference using codon models can be improved by removing unreliable alignment regions. Our study shows that the benefit of removing unreliable regions exceeds the loss of power due to the removal of some of the true positively selected sites.
Resumo:
To establish the relationships of the lizard- and mammal-infecting Leishmania, we characterized the intergenic spacer region of ribosomal RNA genes from L. tarentolae and L. hoogstraali. The organization of these regions is similar to those of other eukaryotes. The intergenic spacer region was approximately 4 kb in L. tarentolae and 5.5 kb in L. hoogstraali. The size difference was due to a greater number of 63-bp repetitive elements in the latter species. This region also contained another element, repeated twice, that had an inverted octanucleotide with the potential to form a stem-loop structure that could be involved in transcription termination or processing events. The ribosomal RNA gene localization showed a distinct pattern with one chromosomal band (2.2 Mb) for L. tarentolae and two (1.5 and 1.3 Mb) for L. hoogstraali. The study also showed sequence differences in the external transcribed region that could be used to distinguish lizard Leishmania from the mammalian Leishmania. The intergenic spacer region structure features found among Leishmania species indicated that lizard and mammalian Leishmania are closely related and support the inclusion of lizard-infecting species into the subgenus Sauroleishmania proposed by Saf'janova in 1982.
Resumo:
Serine repeat antigen 5 (SERA5) is an abundant antigen of the human malaria parasite Plasmodium falciparum and is the most strongly expressed member of the nine-gene SERA family. It appears to be essential for the maintenance of the erythrocytic cycle, unlike a number of other members of this family, and has been implicated in parasite egress and/or erythrocyte invasion. All SERA proteins possess a central domain that has homology to papain except in the case of SERA5 (and some other SERAs), where the active site cysteine has been replaced with a serine. To investigate if this domain retains catalytic activity, we expressed, purified, and refolded a recombinant form of the SERA5 enzyme domain. This protein possessed chymotrypsin-like proteolytic activity as it processed substrates downstream of aromatic residues, and its activity was reversed by the serine protease inhibitor 3,4-diisocoumarin. Although all Plasmodium SERA enzyme domain sequences share considerable homology, phylogenetic studies revealed two distinct clusters across the genus, separated according to whether they possess an active site serine or cysteine. All Plasmodia appear to have at least one member of each group. Consistent with separate biological roles for members of these two clusters, molecular modeling studies revealed that SERA5 and SERA6 enzyme domains have dramatically different surface properties, although both have a characteristic papain-like fold, catalytic cleft, and an appropriately positioned catalytic triad. This study provides impetus for the examination of SERA5 as a target for antimalarial drug design.
Resumo:
Human T cell lymphotropic virus type 1 (HTLV-1) is a retrovirus that causes leukemia and the neurological disorder HTLV-1 associated myelopathy or tropical spastic paraparesis (HAM/TSP). Infection with this virus - although it is distributed worldwide - is limited to certain endemic areas of the world. Despite its specific distribution and slow mutation rate, molecular epidemiology on this virus has been useful to follow the movements of human populations and routes of virus spread to different continents. In the present study, we analyzed the genetic variability of a region of the env gene of isolates obtained from individuals of African origin that live on the Pacific coast of Colombia. Sequencing and comparison of the fragment with the same fragment from different HTLV-1 isolates showed a variability ranging from 0.8% to 1.2%. Phylogenetic studies permit us to include these isolates in the transcontinental subgroup A in which samples isolated from Brazil and Chile are also found. Further analyses will be necessary to determine if these isolates were recently introduced into the American continent or if they rather correspond to isolates introduced during the Paleolithic period.
Resumo:
Natural selection is typically exerted at some specific life stages. If natural selection takes place before a trait can be measured, using conventional models can cause wrong inference about population parameters. When the missing data process relates to the trait of interest, a valid inference requires explicit modeling of the missing process. We propose a joint modeling approach, a shared parameter model, to account for nonrandom missing data. It consists of an animal model for the phenotypic data and a logistic model for the missing process, linked by the additive genetic effects. A Bayesian approach is taken and inference is made using integrated nested Laplace approximations. From a simulation study we find that wrongly assuming that missing data are missing at random can result in severely biased estimates of additive genetic variance. Using real data from a wild population of Swiss barn owls Tyto alba, our model indicates that the missing individuals would display large black spots; and we conclude that genes affecting this trait are already under selection before it is expressed. Our model is a tool to correctly estimate the magnitude of both natural selection and additive genetic variance.