998 resultados para Applied statistics
Resumo:
Motivation: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. Results: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets.
Resumo:
This paper studies the role of income distribution and technology transfer in the process of economic development. A novel aspect of the model is that the composition of human capital as well as the level affect economic growth. Utilizing an overlapping-generations model in which income distribution changes endogenously, we present an economic explanation for why some countries could not start modern economic growth; why some countries took off but have apparently stopped growing after some time; and why some countries have successfully developed and continue to grow
Resumo:
In this paper, we consider testing for additivity in a class of nonparametric stochastic regression models. Two test statistics are constructed and their asymptotic distributions are established. We also conduct a small sample study for one of the test statistics through a simulated example. (C) 2002 Elsevier Science (USA).
Resumo:
This article develops a weighted least squares version of Levene's test of homogeneity of variance for a general design, available both for univariate and multivariate situations. When the design is balanced, the univariate and two common multivariate test statistics turn out to be proportional to the corresponding ordinary least squares test statistics obtained from an analysis of variance of the absolute values of the standardized mean-based residuals from the original analysis of the data. The constant of proportionality is simply a design-dependent multiplier (which does not necessarily tend to unity). Explicit results are presented for randomized block and Latin square designs and are illustrated for factorial treatment designs and split-plot experiments. The distribution of the univariate test statistic is close to a standard F-distribution, although it can be slightly underdispersed. For a complex design, the test assesses homogeneity of variance across blocks, treatments, or treatment factors and offers an objective interpretation of residual plot.
Resumo:
We focus on mixtures of factor analyzers from the perspective of a method for model-based density estimation from high-dimensional data, and hence for the clustering of such data. This approach enables a normal mixture model to be fitted to a sample of n data points of dimension p, where p is large relative to n. The number of free parameters is controlled through the dimension of the latent factor space. By working in this reduced space, it allows a model for each component-covariance matrix with complexity lying between that of the isotropic and full covariance structure models. We shall illustrate the use of mixtures of factor analyzers in a practical example that considers the clustering of cell lines on the basis of gene expressions from microarray experiments. (C) 2002 Elsevier Science B.V. All rights reserved.
Resumo:
We consider a mixture model approach to the regression analysis of competing-risks data. Attention is focused on inference concerning the effects of factors on both the probability of occurrence and the hazard rate conditional on each of the failure types. These two quantities are specified in the mixture model using the logistic model and the proportional hazards model, respectively. We propose a semi-parametric mixture method to estimate the logistic and regression coefficients jointly, whereby the component-baseline hazard functions are completely unspecified. Estimation is based on maximum likelihood on the basis of the full likelihood, implemented via an expectation-conditional maximization (ECM) algorithm. Simulation studies are performed to compare the performance of the proposed semi-parametric method with a fully parametric mixture approach. The results show that when the component-baseline hazard is monotonic increasing, the semi-parametric and fully parametric mixture approaches are comparable for mildly and moderately censored samples. When the component-baseline hazard is not monotonic increasing, the semi-parametric method consistently provides less biased estimates than a fully parametric approach and is comparable in efficiency in the estimation of the parameters for all levels of censoring. The methods are illustrated using a real data set of prostate cancer patients treated with different dosages of the drug diethylstilbestrol. Copyright (C) 2003 John Wiley Sons, Ltd.
Resumo:
In microarray studies, the application of clustering techniques is often used to derive meaningful insights into the data. In the past, hierarchical methods have been the primary clustering tool employed to perform this task. The hierarchical algorithms have been mainly applied heuristically to these cluster analysis problems. Further, a major limitation of these methods is their inability to determine the number of clusters. Thus there is a need for a model-based approach to these. clustering problems. To this end, McLachlan et al. [7] developed a mixture model-based algorithm (EMMIX-GENE) for the clustering of tissue samples. To further investigate the EMMIX-GENE procedure as a model-based -approach, we present a case study involving the application of EMMIX-GENE to the breast cancer data as studied recently in van 't Veer et al. [10]. Our analysis considers the problem of clustering the tissue samples on the basis of the genes which is a non-standard problem because the number of genes greatly exceed the number of tissue samples. We demonstrate how EMMIX-GENE can be useful in reducing the initial set of genes down to a more computationally manageable size. The results from this analysis also emphasise the difficulty associated with the task of separating two tissue groups on the basis of a particular subset of genes. These results also shed light on why supervised methods have such a high misallocation error rate for the breast cancer data.
Resumo:
A theta graph is a graph consisting of three pairwise internally disjoint paths with common end points. Methods for decomposing the complete graph K-nu into theta graphs with fewer than ten edges are given.
Resumo:
We describe a direct method of partitioning the 840 Steiner triple systems of order 9 into 120 large sets. The method produces partitions in which all of the large sets are isomorphic and we apply the method to each of the two non-isomorphic large sets of STS(9).
Resumo:
Regional commodity forecasts are being used increasingly in agricultural industries to enhance their risk management and decision-making processes. These commodity forecasts are probabilistic in nature and are often integrated with a seasonal climate forecast system. The climate forecast system is based on a subset of analogue years drawn from the full climatological distribution. In this study we sought to measure forecast quality for such an integrated system. We investigated the quality of a commodity (i.e. wheat and sugar) forecast based on a subset of analogue years in relation to a standard reference forecast based on the full climatological set. We derived three key dimensions of forecast quality for such probabilistic forecasts: reliability, distribution shift, and change in dispersion. A measure of reliability was required to ensure no bias in the forecast distribution. This was assessed via the slope of the reliability plot, which was derived from examination of probability levels of forecasts and associated frequencies of realizations. The other two dimensions related to changes in features of the forecast distribution relative to the reference distribution. The relationship of 13 published accuracy/skill measures to these dimensions of forecast quality was assessed using principal component analysis in case studies of commodity forecasting using seasonal climate forecasting for the wheat and sugar industries in Australia. There were two orthogonal dimensions of forecast quality: one associated with distribution shift relative to the reference distribution and the other associated with relative distribution dispersion. Although the conventional quality measures aligned with these dimensions, none measured both adequately. We conclude that a multi-dimensional approach to assessment of forecast quality is required and that simple measures of reliability, distribution shift, and change in dispersion provide a means for such assessment. The analysis presented was also relevant to measuring quality of probabilistic seasonal climate forecasting systems. The importance of retaining a focus on the probabilistic nature of the forecast and avoiding simplifying, but erroneous, distortions was discussed in relation to applying this new forecast quality assessment paradigm to seasonal climate forecasts. Copyright (K) 2003 Royal Meteorological Society.
Resumo:
Dissertação apresentada ao Instituto Politécnico do Porto para obtenção do Grau de Mestre em Logística Orientada por: Professora Doutora Patrícia Alexandra Gregório Ramos