868 resultados para data analysis software
Resumo:
Software repositories have been getting a lot of attention from researchers in recent years. In order to analyze software repositories, it is necessary to first extract raw data from the version control and problem tracking systems. This poses two challenges: (1) extraction requires a non-trivial effort, and (2) the results depend on the heuristics used during extraction. These challenges burden researchers that are new to the community and make it difficult to benchmark software repository mining since it is almost impossible to reproduce experiments done by another team. In this paper we present the TA-RE corpus. TA-RE collects extracted data from software repositories in order to build a collection of projects that will simplify extraction process. Additionally the collection can be used for benchmarking. As the first step we propose an exchange language capable of making sharing and reusing data as simple as possible.
Resumo:
Estimation for bivariate right censored data is a problem that has had much study over the past 15 years. In this paper we propose a new class of estimators for the bivariate survival function based on locally efficient estimation. We introduce the locally efficient estimator for bivariate right censored data, present an asymptotic theorem, present the results of simulation studies and perform a brief data analysis illustrating the use of the locally efficient estimator.
Resumo:
We propose robust and e±cient tests and estimators for gene-environment/gene-drug interactions in family-based association studies. The methodology is designed for studies in which haplotypes, quantitative pheno- types and complex exposure/treatment variables are analyzed. Using causal inference methodology, we derive family-based association tests and estimators for the genetic main effects and the interactions. The tests and estimators are robust against population admixture and strati¯cation without requiring adjustment for confounding variables. We illustrate the practical relevance of our approach by an application to a COPD study. The data analysis suggests a gene-environment interaction between a SNP in the Serpine gene and smok- ing status/pack years of smoking that reduces the FEV1 volume by about 0.02 liter per pack year of smoking. Simulation studies show that the pro- posed methodology is su±ciently powered for realistic sample sizes and that it provides valid tests and effect size estimators in the presence of admixture and stratification.
Resumo:
Background: The recent development of semi-automated techniques for staining and analyzing flow cytometry samples has presented new challenges. Quality control and quality assessment are critical when developing new high throughput technologies and their associated information services. Our experience suggests that significant bottlenecks remain in the development of high throughput flow cytometry methods for data analysis and display. Especially, data quality control and quality assessment are crucial steps in processing and analyzing high throughput flow cytometry data. Methods: We propose a variety of graphical exploratory data analytic tools for exploring ungated flow cytometry data. We have implemented a number of specialized functions and methods in the Bioconductor package rflowcyt. We demonstrate the use of these approaches by investigating two independent sets of high throughput flow cytometry data. Results: We found that graphical representations can reveal substantial non-biological differences in samples. Empirical Cumulative Distribution Function and summary scatterplots were especially useful in the rapid identification of problems not identified by manual review. Conclusions: Graphical exploratory data analytic tools are quick and useful means of assessing data quality. We propose that the described visualizations should be used as quality assessment tools and where possible, be used for quality control.
Resumo:
Genomic alterations have been linked to the development and progression of cancer. The technique of Comparative Genomic Hybridization (CGH) yields data consisting of fluorescence intensity ratios of test and reference DNA samples. The intensity ratios provide information about the number of copies in DNA. Practical issues such as the contamination of tumor cells in tissue specimens and normalization errors necessitate the use of statistics for learning about the genomic alterations from array-CGH data. As increasing amounts of array CGH data become available, there is a growing need for automated algorithms for characterizing genomic profiles. Specifically, there is a need for algorithms that can identify gains and losses in the number of copies based on statistical considerations, rather than merely detect trends in the data. We adopt a Bayesian approach, relying on the hidden Markov model to account for the inherent dependence in the intensity ratios. Posterior inferences are made about gains and losses in copy number. Localized amplifications (associated with oncogene mutations) and deletions (associated with mutations of tumor suppressors) are identified using posterior probabilities. Global trends such as extended regions of altered copy number are detected. Since the posterior distribution is analytically intractable, we implement a Metropolis-within-Gibbs algorithm for efficient simulation-based inference. Publicly available data on pancreatic adenocarcinoma, glioblastoma multiforme and breast cancer are analyzed, and comparisons are made with some widely-used algorithms to illustrate the reliability and success of the technique.
Resumo:
Visualization and exploratory analysis is an important part of any data analysis and is made more challenging when the data are voluminous and high-dimensional. One such example is environmental monitoring data, which are often collected over time and at multiple locations, resulting in a geographically indexed multivariate time series. Financial data, although not necessarily containing a geographic component, present another source of high-volume multivariate time series data. We present the mvtsplot function which provides a method for visualizing multivariate time series data. We outline the basic design concepts and provide some examples of its usage by applying it to a database of ambient air pollution measurements in the United States and to a hypothetical portfolio of stocks.
Resumo:
Many seemingly disparate approaches for marginal modeling have been developed in recent years. We demonstrate that many current approaches for marginal modeling of correlated binary outcomes produce likelihoods that are equivalent to the proposed copula-based models herein. These general copula models of underlying latent threshold random variables yield likelihood based models for marginal fixed effects estimation and interpretation in the analysis of correlated binary data. Moreover, we propose a nomenclature and set of model relationships that substantially elucidates the complex area of marginalized models for binary data. A diverse collection of didactic mathematical and numerical examples are given to illustrate concepts.
Resumo:
This paper proposes Poisson log-linear multilevel models to investigate population variability in sleep state transition rates. We specifically propose a Bayesian Poisson regression model that is more flexible, scalable to larger studies, and easily fit than other attempts in the literature. We further use hierarchical random effects to account for pairings of individuals and repeated measures within those individuals, as comparing diseased to non-diseased subjects while minimizing bias is of epidemiologic importance. We estimate essentially non-parametric piecewise constant hazards and smooth them, and allow for time varying covariates and segment of the night comparisons. The Bayesian Poisson regression is justified through a re-derivation of a classical algebraic likelihood equivalence of Poisson regression with a log(time) offset and survival regression assuming piecewise constant hazards. This relationship allows us to synthesize two methods currently used to analyze sleep transition phenomena: stratified multi-state proportional hazards models and log-linear models with GEE for transition counts. An example data set from the Sleep Heart Health Study is analyzed.