155 resultados para LIKELIHOOD APPROACH
em University of Queensland eSpace - Australia
Resumo:
When the data consist of certain attributes measured on the same set of items in different situations, they would be described as a three-mode three-way array. A mixture likelihood approach can be implemented to cluster the items (i.e., one of the modes) on the basis of both of the other modes simultaneously (i.e,, the attributes measured in different situations). In this paper, it is shown that this approach can be extended to handle three-mode three-way arrays where some of the data values are missing at random in the sense of Little and Rubin (1987). The methodology is illustrated by clustering the genotypes in a three-way soybean data set where various attributes were measured on genotypes grown in several environments.
Resumo:
Understanding the genetic architecture of quantitative traits can greatly assist the design of strategies for their manipulation in plant-breeding programs. For a number of traits, genetic variation can be the result of segregation of a few major genes and many polygenes (minor genes). The joint segregation analysis (JSA) is a maximum-likelihood approach for fitting segregation models through the simultaneous use of phenotypic information from multiple generations. Our objective in this paper was to use computer simulation to quantify the power of the JSA method for testing the mixed-inheritance model for quantitative traits when it was applied to the six basic generations: both parents (P-1 and P-2), F-1, F-2, and both backcross generations (B-1 and B-2) derived from crossing the F-1 to each parent. A total of 1968 genetic model-experiment scenarios were considered in the simulation study to quantify the power of the method. Factors that interacted to influence the power of the JSA method to correctly detect genetic models were: (1) whether there were one or two major genes in combination with polygenes, (2) the heritability of the major genes and polygenes, (3) the level of dispersion of the major genes and polygenes between the two parents, and (4) the number of individuals examined in each generation (population size). The greatest levels of power were observed for the genetic models defined with simple inheritance; e.g., the power was greater than 90% for the one major gene model, regardless of the population size and major-gene heritability. Lower levels of power were observed for the genetic models with complex inheritance (major genes and polygenes), low heritability, small population sizes and a large dispersion of favourable genes among the two parents; e.g., the power was less than 5% for the two major-gene model with a heritability value of 0.3 and population sizes of 100 individuals. The JSA methodology was then applied to a previously studied sorghum data-set to investigate the genetic control of the putative drought resistance-trait osmotic adjustment in three crosses. The previous study concluded that there were two major genes segregating for osmotic adjustment in the three crosses. Application of the JSA method resulted in a change in the proposed genetic model. The presence of the two major genes was confirmed with the addition of an unspecified number of polygenes.
Resumo:
A hybrid zone between the grasshoppers Chorthippus brunneus and C. jacobsi (Orthoptera: Acrididae) in northern Spain has been analyzed for variation in morphology and ecology. These species are readily distinguished by the number of stridulatory pegs on the hind femur. Both sexes are fully winged and inhabit disturbed habitats throughout the study area. We develop a maximum-likelihood approach to fitting a two-dimensional cline to geographical variation in quantitative traits and for estimating associations of population mean with local habitat. This method reveals a cline in peg number approximately 30 km south of the Picos de Europa Mountains that shows substantial deviations in population mean compared with the expectations of simple tension zone models. The inclusion of variation in local vegetation in the model explains a significant proportion of the residual variation in peg number, indicating that habitat-genotype associations contribute to the observed spatial pattern. However, this association is weak, and a number of populations continue to show strong deviations in mean even after habitat is included in the final model. These outliers may be the result of long-distance colonization of sites distant from the cline center or may be due to a patchy pattern of initial contact during postglacial expansion. As well as contrasting with the smooth hybrid zones described for Chorthippus parallelus, this situation also contrasts with the mosaic hybrid zones observed in Gryllus crickets and in parts of the hybrid zone between Bombina toad species, where habitat-genotype associations account for substantial amounts of among-site variation.
Resumo:
We have developed an alignment-free method that calculates phylogenetic distances using a maximum-likelihood approach for a model of sequence change on patterns that are discovered in unaligned sequences. To evaluate the phylogenetic accuracy of our method, and to conduct a comprehensive comparison of existing alignment-free methods (freely available as Python package decaf+py at http://www.bioinformatics.org.au), we have created a data set of reference trees covering a wide range of phylogenetic distances. Amino acid sequences were evolved along the trees and input to the tested methods; from their calculated distances we infered trees whose topologies we compared to the reference trees. We find our pattern-based method statistically superior to all other tested alignment-free methods. We also demonstrate the general advantage of alignment-free methods over an approach based on automated alignments when sequences violate the assumption of collinearity. Similarly, we compare methods on empirical data from an existing alignment benchmark set that we used to derive reference distances and trees. Our pattern-based approach yields distances that show a linear relationship to reference distances over a substantially longer range than other alignment-free methods. The pattern-based approach outperforms alignment-free methods and its phylogenetic accuracy is statistically indistinguishable from alignment-based distances.
Resumo:
A two-component survival mixture model is proposed to analyse a set of ischaemic stroke-specific mortality data. The survival experience of stroke patients after index stroke may be described by a subpopulation of patients in the acute condition and another subpopulation of patients in the chronic phase. To adjust for the inherent correlation of observations due to random hospital effects, a mixture model of two survival functions with random effects is formulated. Assuming a Weibull hazard in both components, an EM algorithm is developed for the estimation of fixed effect parameters and variance components. A simulation study is conducted to assess the performance of the two-component survival mixture model estimators. Simulation results confirm the applicability of the proposed model in a small sample setting. Copyright (C) 2004 John Wiley Sons, Ltd.
Resumo:
The 16S rRNA gene (16S rDNA) is currently the most widely used gene for estimating the evolutionary history of prokaryotes, To date, there are more than 30 000 16S rDNA sequences available from the core databases, GenBank, EMBL and DDBJ, This great number may cause a dilemma when composing datasets for phylogenetic analysis, since the choice and number of reference organisms are known to affect the resulting tree topology. A group of sequences appearing monophyletic in one dataset may not be so in another. This can be especially problematic when establishing the relationships of distantly related sequences at the division (phylum) level. In this study, a multiple-outgroup approach to resolving division-level phylogenetic relationships is suggested using 16S rDNA data. The approach is illustrated by two case studies concerning the monophyly of two recently proposed bacterial divisions, OP9 and OP10.
Resumo:
Binning and truncation of data are common in data analysis and machine learning. This paper addresses the problem of fitting mixture densities to multivariate binned and truncated data. The EM approach proposed by McLachlan and Jones (Biometrics, 44: 2, 571-578, 1988) for the univariate case is generalized to multivariate measurements. The multivariate solution requires the evaluation of multidimensional integrals over each bin at each iteration of the EM procedure. Naive implementation of the procedure can lead to computationally inefficient results. To reduce the computational cost a number of straightforward numerical techniques are proposed. Results on simulated data indicate that the proposed methods can achieve significant computational gains with no loss in the accuracy of the final parameter estimates. Furthermore, experimental results suggest that with a sufficient number of bins and data points it is possible to estimate the true underlying density almost as well as if the data were not binned. The paper concludes with a brief description of an application of this approach to diagnosis of iron deficiency anemia, in the context of binned and truncated bivariate measurements of volume and hemoglobin concentration from an individual's red blood cells.
Resumo:
Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.
Resumo:
We consider a mixture model approach to the regression analysis of competing-risks data. Attention is focused on inference concerning the effects of factors on both the probability of occurrence and the hazard rate conditional on each of the failure types. These two quantities are specified in the mixture model using the logistic model and the proportional hazards model, respectively. We propose a semi-parametric mixture method to estimate the logistic and regression coefficients jointly, whereby the component-baseline hazard functions are completely unspecified. Estimation is based on maximum likelihood on the basis of the full likelihood, implemented via an expectation-conditional maximization (ECM) algorithm. Simulation studies are performed to compare the performance of the proposed semi-parametric method with a fully parametric mixture approach. The results show that when the component-baseline hazard is monotonic increasing, the semi-parametric and fully parametric mixture approaches are comparable for mildly and moderately censored samples. When the component-baseline hazard is not monotonic increasing, the semi-parametric method consistently provides less biased estimates than a fully parametric approach and is comparable in efficiency in the estimation of the parameters for all levels of censoring. The methods are illustrated using a real data set of prostate cancer patients treated with different dosages of the drug diethylstilbestrol. Copyright (C) 2003 John Wiley Sons, Ltd.
Resumo:
In simultaneous analyses of multiple data partitions, the trees relevant when measuring support for a clade are the optimal tree, and the best tree lacking the clade (i.e., the most reasonable alternative). The parsimony-based method of partitioned branch support (PBS) forces each data set to arbitrate between the two relevant trees. This value is the amount each data set contributes to clade support in the combined analysis, and can be very different to support apparent in separate analyses. The approach used in PBS can also be employed in likelihood: a simultaneous analysis of all data retrieves the maximum likelihood tree, and the best tree without the clade of interest is also found. Each data set is fitted to the two trees and the log-likelihood difference calculated, giving partitioned likelihood support (PLS) for each data set. These calculations can be performed regardless of the complexity of the ML model adopted. The significance of PLS can be evaluated using a variety of resampling methods, such as the Kishino-Hasegawa test, the Shimodiara-Hasegawa test, or likelihood weights, although the appropriateness and assumptions of these tests remains debated.
Resumo:
We consider the problem of assessing the number of clusters in a limited number of tissue samples containing gene expressions for possibly several thousands of genes. It is proposed to use a normal mixture model-based approach to the clustering of the tissue samples. One advantage of this approach is that the question on the number of clusters in the data can be formulated in terms of a test on the smallest number of components in the mixture model compatible with the data. This test can be carried out on the basis of the likelihood ratio test statistic, using resampling to assess its null distribution. The effectiveness of this approach is demonstrated on simulated data and on some microarray datasets, as considered previously in the bioinformatics literature. (C) 2004 Elsevier Inc. All rights reserved.
Resumo:
Background: Synovial sarcoma is a high grade sarcoma that usually occurs in adults. Numerous studies have attempted to identify prognostic factors that might allow more effective treatment for particular subgroups of patients. Methods: We studied 25 histologically confirmed cases of synovial sarcoma in an attempt to identify particular patient, tumour or treatment characteristics that might have a prognostic significance using Cox proportional hazards regression modelling to identify differences in survival rates. All patients received their definitive surgical treatment from a single orthopaedic surgeon reducing the likelihood of bias related to variations in surgical technique. Results: Statistically significant higher survival rates were seen in female patients (P = 0.040) and in patients aged
Resumo:
The schema of an information system can significantly impact the ability of end users to efficiently and effectively retrieve the information they need. Obtaining quickly the appropriate data increases the likelihood that an organization will make good decisions and respond adeptly to challenges. This research presents and validates a methodology for evaluating, ex ante, the relative desirability of alternative instantiations of a model of data. In contrast to prior research, each instantiation is based on a different formal theory. This research theorizes that the instantiation that yields the lowest weighted average query complexity for a representative sample of information requests is the most desirable instantiation for end-user queries. The theory was validated by an experiment that compared end-user performance using an instantiation of a data structure based on the relational model of data with performance using the corresponding instantiation of the data structure based on the object-relational model of data. Complexity was measured using three different Halstead metrics: program length, difficulty, and effort. For a representative sample of queries, the average complexity using each instantiation was calculated. As theorized, end users querying the instantiation with the lower average complexity made fewer semantic errors, i.e., were more effective at composing queries. (c) 2005 Elsevier B.V. All rights reserved.
Resumo:
Objective: Inpatient length of stay (LOS) is an important measure of hospital activity, health care resource consumption, and patient acuity. This research work aims at developing an incremental expectation maximization (EM) based learning approach on mixture of experts (ME) system for on-line prediction of LOS. The use of a batchmode learning process in most existing artificial neural networks to predict LOS is unrealistic, as the data become available over time and their pattern change dynamically. In contrast, an on-line process is capable of providing an output whenever a new datum becomes available. This on-the-spot information is therefore more useful and practical for making decisions, especially when one deals with a tremendous amount of data. Methods and material: The proposed approach is illustrated using a real example of gastroenteritis LOS data. The data set was extracted from a retrospective cohort study on all infants born in 1995-1997 and their subsequent admissions for gastroenteritis. The total number of admissions in this data set was n = 692. Linked hospitalization records of the cohort were retrieved retrospectively to derive the outcome measure, patient demographics, and associated co-morbidities information. A comparative study of the incremental learning and the batch-mode learning algorithms is considered. The performances of the learning algorithms are compared based on the mean absolute difference (MAD) between the predictions and the actual LOS, and the proportion of predictions with MAD < 1 day (Prop(MAD < 1)). The significance of the comparison is assessed through a regression analysis. Results: The incremental learning algorithm provides better on-line prediction of LOS when the system has gained sufficient training from more examples (MAD = 1.77 days and Prop(MAD < 1) = 54.3%), compared to that using the batch-mode learning. The regression analysis indicates a significant decrease of MAD (p-value = 0.063) and a significant (p-value = 0.044) increase of Prop(MAD
Resumo:
Looking uphill towards house from road.