22 resultados para Inference module
em Consorci de Serveis Universitaris de Catalunya (CSUC), Spain
Resumo:
We present experimental and theoretical analyses of data requirements for haplotype inference algorithms. Our experiments include a broad range of problem sizes under two standard models of tree distribution and were designed to yield statistically robust results despite the size of the sample space. Our results validate Gusfield's conjecture that a population size of n log n is required to give (with high probability) sufficient information to deduce the n haplotypes and their complete evolutionary history. The experimental results inspired our experimental finding with theoretical bounds on the population size. We also analyze the population size required to deduce some fixed fraction of the evolutionary history of a set of n haplotypes and establish linear bounds on the required sample size. These linear bounds are also shown theoretically.
Resumo:
Consider a model with parameter phi, and an auxiliary model with parameter theta. Let phi be a randomly sampled from a given density over the known parameter space. Monte Carlo methods can be used to draw simulated data and compute the corresponding estimate of theta, say theta_tilde. A large set of tuples (phi, theta_tilde) can be generated in this manner. Nonparametric methods may be use to fit the function E(phi|theta_tilde=a), using these tuples. It is proposed to estimate phi using the fitted E(phi|theta_tilde=theta_hat), where theta_hat is the auxiliary estimate, using the real sample data. This is a consistent and asymptotically normally distributed estimator, under certain assumptions. Monte Carlo results for dynamic panel data and vector autoregressions show that this estimator can have very attractive small sample properties. Confidence intervals can be constructed using the quantiles of the phi for which theta_tilde is close to theta_hat. Such confidence intervals are found to have very accurate coverage.
Resumo:
Given a sample from a fully specified parametric model, let Zn be a given finite-dimensional statistic - for example, an initial estimator or a set of sample moments. We propose to (re-)estimate the parameters of the model by maximizing the likelihood of Zn. We call this the maximum indirect likelihood (MIL) estimator. We also propose a computationally tractable Bayesian version of the estimator which we refer to as a Bayesian Indirect Likelihood (BIL) estimator. In most cases, the density of the statistic will be of unknown form, and we develop simulated versions of the MIL and BIL estimators. We show that the indirect likelihood estimators are consistent and asymptotically normally distributed, with the same asymptotic variance as that of the corresponding efficient two-step GMM estimator based on the same statistic. However, our likelihood-based estimators, by taking into account the full finite-sample distribution of the statistic, are higher order efficient relative to GMM-type estimators. Furthermore, in many cases they enjoy a bias reduction property similar to that of the indirect inference estimator. Monte Carlo results for a number of applications including dynamic and nonlinear panel data models, a structural auction model and two DSGE models show that the proposed estimators indeed have attractive finite sample properties.
Resumo:
Low concentrations of elements in geochemical analyses have the peculiarity of beingcompositional data and, for a given level of significance, are likely to be beyond thecapabilities of laboratories to distinguish between minute concentrations and completeabsence, thus preventing laboratories from reporting extremely low concentrations of theanalyte. Instead, what is reported is the detection limit, which is the minimumconcentration that conclusively differentiates between presence and absence of theelement. A spatially distributed exhaustive sample is employed in this study to generateunbiased sub-samples, which are further censored to observe the effect that differentdetection limits and sample sizes have on the inference of population distributionsstarting from geochemical analyses having specimens below detection limit (nondetects).The isometric logratio transformation is used to convert the compositional data in thesimplex to samples in real space, thus allowing the practitioner to properly borrow fromthe large source of statistical techniques valid only in real space. The bootstrap method isused to numerically investigate the reliability of inferring several distributionalparameters employing different forms of imputation for the censored data. The casestudy illustrates that, in general, best results are obtained when imputations are madeusing the distribution best fitting the readings above detection limit and exposes theproblems of other more widely used practices. When the sample is spatially correlated, itis necessary to combine the bootstrap with stochastic simulation
Resumo:
First: A continuous-time version of Kyle's model (Kyle 1985), known as the Back's model (Back 1992), of asset pricing with asymmetric information, is studied. A larger class of price processes and of noise traders' processes are studied. The price process, as in Kyle's model, is allowed to depend on the path of the market order. The process of the noise traders' is an inhomogeneous Lévy process. Solutions are found by the Hamilton-Jacobi-Bellman equations. With the insider being risk-neutral, the price pressure is constant, and there is no equilibirium in the presence of jumps. If the insider is risk-averse, there is no equilibirium in the presence of either jumps or drifts. Also, it is analised when the release time is unknown. A general relation is established between the problem of finding an equilibrium and of enlargement of filtrations. Random announcement time is random is also considered. In such a case the market is not fully efficient and there exists equilibrium if the sensitivity of prices with respect to the global demand is time decreasing according with the distribution of the random time. Second: Power variations. it is considered, the asymptotic behavior of the power variation of processes of the form _integral_0^t u(s-)dS(s), where S_ is an alpha-stable process with index of stability 0&alpha&2 and the integral is an Itô integral. Stable convergence of corresponding fluctuations is established. These results provide statistical tools to infer the process u from discrete observations. Third: A bond market is studied where short rates r(t) evolve as an integral of g(t-s)sigma(s) with respect to W(ds), where g and sigma are deterministic and W is the stochastic Wiener measure. Processes of this type are particular cases of ambit processes. These processes are in general not of the semimartingale kind.
Resumo:
Background: Two genes are called synthetic lethal (SL) if mutation of either alone is not lethal, but mutation of both leads to death or a significant decrease in organism's fitness. The detection of SL gene pairs constitutes a promising alternative for anti-cancer therapy. As cancer cells exhibit a large number of mutations, the identification of these mutated genes' SL partners may provide specific anti-cancer drug candidates, with minor perturbations to the healthy cells. Since existent SL data is mainly restricted to yeast screenings, the road towards human SL candidates is limited to inference methods. Results: In the present work, we use phylogenetic analysis and database manipulation (BioGRID for interactions, Ensembl and NCBI for homology, Gene Ontology for GO attributes) in order to reconstruct the phylogenetically-inferred SL gene network for human. In addition, available data on cancer mutated genes (COSMIC and Cancer Gene Census databases) as well as on existent approved drugs (DrugBank database) supports our selection of cancer-therapy candidates.Conclusions: Our work provides a complementary alternative to the current methods for drug discovering and gene target identification in anti-cancer research. Novel SL screening analysis and the use of highly curated databases would contribute to improve the results of this methodology.
Resumo:
The paper presents a competence-based instructional design system and a way to provide a personalization of navigation in the course content. The navigation aid tool builds on the competence graph and the student model, which includes the elements of uncertainty in the assessment of students. An individualized navigation graph is constructed for each student, suggesting the competences the student is more prepared to study. We use fuzzy set theory for dealing with uncertainty. The marks of the assessment tests are transformed into linguistic terms and used for assigning values to linguistic variables. For each competence, the level of difficulty and the level of knowing its prerequisites are calculated based on the assessment marks. Using these linguistic variables and approximate reasoning (fuzzy IF-THEN rules), a crisp category is assigned to each competence regarding its level of recommendation.
Resumo:
Small sample properties are of fundamental interest when only limited data is avail-able. Exact inference is limited by constraints imposed by speci.c nonrandomizedtests and of course also by lack of more data. These e¤ects can be separated as we propose to evaluate a test by comparing its type II error to the minimal type II error among all tests for the given sample. Game theory is used to establish this minimal type II error, the associated randomized test is characterized as part of a Nash equilibrium of a .ctitious game against nature.We use this method to investigate sequential tests for the di¤erence between twomeans when outcomes are constrained to belong to a given bounded set. Tests ofinequality and of noninferiority are included. We .nd that inference in terms oftype II error based on a balanced sample cannot be improved by sequential sampling or even by observing counter factual evidence providing there is a reasonable gap between the hypotheses.
Resumo:
The public transportation is gaining importance every year basically duethe population growth, environmental policies and, route and streetcongestion. Too able an efficient management of all the resources relatedto public transportation, several techniques from different areas are beingapplied and several projects in Transportation Planning Systems, indifferent countries, are being developed. In this work, we present theGIST Planning Transportation Systems, a Portuguese project involving twouniversities and six public transportation companies. We describe indetail one of the most relevant modules of this project, the crew-scheduling module. The crew-scheduling module is based on the application of meta-heuristics, in particular GRASP, tabu search and geneticalgorithm to solve the bus-driver-scheduling problem. The metaheuristicshave been successfully incorporated in the GIST Planning TransportationSystems and are actually used by several companies.
Resumo:
Several estimators of the expectation, median and mode of the lognormal distribution are derived. They aim to be approximately unbiased, efficient, or have a minimax property in the class of estimators we introduce. The small-sample properties of these estimators are assessed by simulations and, when possible, analytically. Some of these estimators of the expectation are far more efficient than the maximum likelihood or the minimum-variance unbiased estimator, even for substantial samplesizes.
Resumo:
This paper discusses inference in self exciting threshold autoregressive (SETAR)models. Of main interest is inference for the threshold parameter. It iswell-known that the asymptotics of the corresponding estimator depend uponwhether the SETAR model is continuous or not. In the continuous case, thelimiting distribution is normal and standard inference is possible. Inthe discontinuous case, the limiting distribution is non-normal and cannotbe estimated consistently. We show valid inference can be drawn by theuse of the subsampling method. Moreover, the method can even be extendedto situations where the (dis)continuity of the model is unknown. In thiscase, also the inference for the regression parameters of the modelbecomes difficult and subsampling can be used advantageously there aswell. In addition, we consider an hypothesis test for the continuity ofthe SETAR model. A simulation study examines small sample performance.
Resumo:
Abstract Background: Many complex systems can be represented and analysed as networks. The recent availability of large-scale datasets, has made it possible to elucidate some of the organisational principles and rules that govern their function, robustness and evolution. However, one of the main limitations in using protein-protein interactions for function prediction is the availability of interaction data, especially for Mollicutes. If we could harness predicted interactions, such as those from a Protein-Protein Association Networks (PPAN), combining several protein-protein network function-inference methods with semantic similarity calculations, the use of protein-protein interactions for functional inference in this species would become more potentially useful. Results: In this work we show that using PPAN data combined with other approximations, such as functional module detection, orthology exploitation methods and Gene Ontology (GO)-based information measures helps to predict protein function in Mycoplasma genitalium. Conclusions: To our knowledge, the proposed method is the first that combines functional module detection among species, exploiting an orthology procedure and using information theory-based GO semantic similarity in PPAN of the Mycoplasma species. The results of an evaluation show a higher recall than previously reported methods that focused on only one organism network.
Resumo:
With the increasing availability of various 'omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk download.
Resumo:
A new, quantitative, inference model for environmental reconstruction (transfer function), based for the first time on the simultaneous analysis of multigroup species, has been developed. Quantitative reconstructions based on palaeoecological transfer functions provide a powerful tool for addressing questions of environmental change in a wide range of environments, from oceans to mountain lakes, and over a range of timescales, from decades to millions of years. Much progress has been made in the development of inferences based on multiple proxies but usually these have been considered separately, and the different numeric reconstructions compared and reconciled post-hoc. This paper presents a new method to combine information from multiple biological groups at the reconstruction stage. The aim of the multigroup work was to test the potential of the new approach to making improved inferences of past environmental change by improving upon current reconstruction methodologies. The taxonomic groups analysed include diatoms, chironomids and chrysophyte cysts. We test the new methodology using two cold-environment training-sets, namely mountain lakes from the Pyrenees and the Alps. The use of multiple groups, as opposed to single groupings, was only found to increase the reconstruction skill slightly, as measured by the root mean square error of prediction (leave-one-out cross-validation), in the case of alkalinity, dissolved inorganic carbon and altitude (a surrogate for air-temperature), but not for pH or dissolved CO2. Reasons why the improvement was less than might have been anticipated are discussed. These can include the different life-forms, environmental responses and reaction times of the groups under study.