10 resultados para missing data imputation

em Helda - Digital Repository of University of Helsinki


Relevância:

80.00% 80.00%

Publicador:

Resumo:

This thesis studies human gene expression space using high throughput gene expression data from DNA microarrays. In molecular biology, high throughput techniques allow numerical measurements of expression of tens of thousands of genes simultaneously. In a single study, this data is traditionally obtained from a limited number of sample types with a small number of replicates. For organism-wide analysis, this data has been largely unavailable and the global structure of human transcriptome has remained unknown. This thesis introduces a human transcriptome map of different biological entities and analysis of its general structure. The map is constructed from gene expression data from the two largest public microarray data repositories, GEO and ArrayExpress. The creation of this map contributed to the development of ArrayExpress by identifying and retrofitting the previously unusable and missing data and by improving the access to its data. It also contributed to creation of several new tools for microarray data manipulation and establishment of data exchange between GEO and ArrayExpress. The data integration for the global map required creation of a new large ontology of human cell types, disease states, organism parts and cell lines. The ontology was used in a new text mining and decision tree based method for automatic conversion of human readable free text microarray data annotations into categorised format. The data comparability and minimisation of the systematic measurement errors that are characteristic to each lab- oratory in this large cross-laboratories integrated dataset, was ensured by computation of a range of microarray data quality metrics and exclusion of incomparable data. The structure of a global map of human gene expression was then explored by principal component analysis and hierarchical clustering using heuristics and help from another purpose built sample ontology. A preface and motivation to the construction and analysis of a global map of human gene expression is given by analysis of two microarray datasets of human malignant melanoma. The analysis of these sets incorporate indirect comparison of statistical methods for finding differentially expressed genes and point to the need to study gene expression on a global level.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Whether a statistician wants to complement a probability model for observed data with a prior distribution and carry out fully probabilistic inference, or base the inference only on the likelihood function, may be a fundamental question in theory, but in practice it may well be of less importance if the likelihood contains much more information than the prior. Maximum likelihood inference can be justified as a Gaussian approximation at the posterior mode, using flat priors. However, in situations where parametric assumptions in standard statistical models would be too rigid, more flexible model formulation, combined with fully probabilistic inference, can be achieved using hierarchical Bayesian parametrization. This work includes five articles, all of which apply probability modeling under various problems involving incomplete observation. Three of the papers apply maximum likelihood estimation and two of them hierarchical Bayesian modeling. Because maximum likelihood may be presented as a special case of Bayesian inference, but not the other way round, in the introductory part of this work we present a framework for probability-based inference using only Bayesian concepts. We also re-derive some results presented in the original articles using the toolbox equipped herein, to show that they are also justifiable under this more general framework. Here the assumption of exchangeability and de Finetti's representation theorem are applied repeatedly for justifying the use of standard parametric probability models with conditionally independent likelihood contributions. It is argued that this same reasoning can be applied also under sampling from a finite population. The main emphasis here is in probability-based inference under incomplete observation due to study design. This is illustrated using a generic two-phase cohort sampling design as an example. The alternative approaches presented for analysis of such a design are full likelihood, which utilizes all observed information, and conditional likelihood, which is restricted to a completely observed set, conditioning on the rule that generated that set. Conditional likelihood inference is also applied for a joint analysis of prevalence and incidence data, a situation subject to both left censoring and left truncation. Other topics covered are model uncertainty and causal inference using posterior predictive distributions. We formulate a non-parametric monotonic regression model for one or more covariates and a Bayesian estimation procedure, and apply the model in the context of optimal sequential treatment regimes, demonstrating that inference based on posterior predictive distributions is feasible also in this case.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we develop efficient algorithms for a commonly occurring search problem - searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X->A or X->not A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequent A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine - i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which - in spite of their name and a similar appearance to dependency rules - do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another important objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher's exact test, the chi-squared measure, mutual information, and z scores. According to our experiments, the algorithm is well-scalable, especially with Fisher's exact test. It can easily handle even the densest data sets with 10000-20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We report a search for single top quark production with the CDF II detector using 2.1 fb-1 of integrated luminosity of pbar p collisions at sqrt{s}=1.96 TeV. The data selected consist of events characterized by large energy imbalance in the transverse plane and hadronic jets, and no identified electrons and muons, so the sample is enriched in W -> tau nu decays. In order to suppress backgrounds, additional kinematic and topological requirements are imposed through a neural network, and at least one of the jets must be identified as a b-quark jet. We measure an excess of signal-like events in agreement with the standard model prediction, but inconsistent with a model without single top quark production by 2.1 standard deviations (sigma), with a median expected sensitivity of 1.4 sigma. Assuming a top quark mass of 175 GeV/c2 and ascribing the excess to single top quark production, the cross section is measured to be 4.9+2.5-2.2(stat+syst)pb, consistent with measurements performed in independent datasets and with the standard model prediction.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Using data from 2.9  fb-1 of integrated luminosity collected with the CDF II detector at the Tevatron, we search for resonances decaying into a pair of on-shell gauge bosons, WW or WZ, where one W decays into an electron and a neutrino, and the other boson decays into two jets. We observed no statistically significant excess above the expected standard model background, and we set cross section limits at 95% confidence level on G* (Randall-Sundrum graviton), Z′, and W′ bosons. By comparing these limits to theoretical cross sections, mass exclusion regions for the three particles are derived. The mass exclusion regions for Z′ and W′ are further evaluated as a function of their gauge coupling strength.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Using data from 2.9/fb of integrated luminosity collected with the CDF II detector at the Tevatron, we search for resonances decaying into a pair of on-shell gauge bosons, WW or WZ, where one W decays into an electron and a neutrino, and the other boson decays into two jets. We observed no statistically significant excess above the expected standard model background, and we set cross section limits at 95% confidence level on G*(Randall-Sundrum graviton), Z', and W' bosons. By comparing these limits to theoretical cross sections, mass exclusion regions for the three particles are derived. The mass exclusion regions for Z' and W' are further evaluated as a function of their gauge coupling strength.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We report on a search for the standard model Higgs boson produced in association with a $W$ or $Z$ boson in $p\bar{p}$ collisions at $\sqrt{s} = 1.96$ TeV recorded by the CDF II experiment at the Tevatron in a data sample corresponding to an integrated luminosity of 2.1 fb$^{-1}$. We consider events which have no identified charged leptons, an imbalance in transverse momentum, and two or three jets where at least one jet is consistent with originating from the decay of a $b$ hadron. We find good agreement between data and predictions. We place 95% confidence level upper limits on the production cross section for several Higgs boson masses ranging from 110$\gevm$ to 150$\gevm$. For a mass of 115$\gevm$ the observed (expected) limit is 6.9 (5.6) times the standard model prediction.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We present a signature-based search for anomalous production of events containing a photon, two jets, of which at least one is identified as originating from a b quark, and missing transverse energy. The search uses data corresponding to 2.0/fb of integrated luminosity from p-pbar collisions at a center-of-mass energy of sqrt(s)=1.96 TeV, collected with the CDF II detector at the Fermilab Tevatron. From 6,697,466 events with a photon candidate with transverse energy ET> 25 GeV, we find 617 events with missing transverse energy > 25 GeV and two or more jets with ET> 15 GeV, at least one identified as originating from a b quark, versus an expectation of 607+- 113 events. Increasing the requirement on missing transverse energy to 50 GeV, we find 28 events versus an expectation of 30+-11 events. We find no indications of non-standard-model phenomena.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

We present results of a signature-based search for new physics using a dijet plus missing transverse energy data sample collected in 2 fb-1 of p-pbar collisions at sqrt(s) = 1.96 TeV with the CDF II detector at the Fermilab Tevatron. We observe no significant event excess with respect to the standard model prediction and extract a 95% C.L. upper limit on the cross section times acceptance for a potential contribution from a non-standard model process. Based on this limit the mass of a first or second generation scalar leptoquark is constrained to be above 187 GeV/c^2.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

In recent years, thanks to developments in information technology, large-dimensional datasets have been increasingly available. Researchers now have access to thousands of economic series and the information contained in them can be used to create accurate forecasts and to test economic theories. To exploit this large amount of information, researchers and policymakers need an appropriate econometric model.Usual time series models, vector autoregression for example, cannot incorporate more than a few variables. There are two ways to solve this problem: use variable selection procedures or gather the information contained in the series to create an index model. This thesis focuses on one of the most widespread index model, the dynamic factor model (the theory behind this model, based on previous literature, is the core of the first part of this study), and its use in forecasting Finnish macroeconomic indicators (which is the focus of the second part of the thesis). In particular, I forecast economic activity indicators (e.g. GDP) and price indicators (e.g. consumer price index), from 3 large Finnish datasets. The first dataset contains a large series of aggregated data obtained from the Statistics Finland database. The second dataset is composed by economic indicators from Bank of Finland. The last dataset is formed by disaggregated data from Statistic Finland, which I call micro dataset. The forecasts are computed following a two steps procedure: in the first step I estimate a set of common factors from the original dataset. The second step consists in formulating forecasting equations including the factors extracted previously. The predictions are evaluated using relative mean squared forecast error, where the benchmark model is a univariate autoregressive model. The results are dataset-dependent. The forecasts based on factor models are very accurate for the first dataset (the Statistics Finland one), while they are considerably worse for the Bank of Finland dataset. The forecasts derived from the micro dataset are still good, but less accurate than the ones obtained in the first case. This work leads to multiple research developments. The results here obtained can be replicated for longer datasets. The non-aggregated data can be represented in an even more disaggregated form (firm level). Finally, the use of the micro data, one of the major contributions of this thesis, can be useful in the imputation of missing values and the creation of flash estimates of macroeconomic indicator (nowcasting).