25 resultados para Datasets


Relevância:

10.00% 10.00%

Publicador:

Resumo:

BACKGROUND: Administrative or quality improvement registries may or may not contain the elements needed for investigations by trauma researchers. International Classification of Diseases Program for Injury Categorisation (ICDPIC), a statistical program available through Stata, is a powerful tool that can extract injury severity scores from ICD-9-CM codes. We conducted a validation study for use of the ICDPIC in trauma research. METHODS: We conducted a retrospective cohort validation study of 40,418 patients with injury using a large regional trauma registry. ICDPIC-generated AIS scores for each body region were compared with trauma registry AIS scores (gold standard) in adult and paediatric populations. A separate analysis was conducted among patients with traumatic brain injury (TBI) comparing the ICDPIC tool with ICD-9-CM embedded severity codes. Performance in characterising overall injury severity, by the ISS, was also assessed. RESULTS: The ICDPIC tool generated substantial correlations in thoracic and abdominal trauma (weighted κ 0.87-0.92), and in head and neck trauma (weighted κ 0.76-0.83). The ICDPIC tool captured TBI severity better than ICD-9-CM code embedded severity and offered the advantage of generating a severity value for every patient (rather than having missing data). Its ability to produce an accurate severity score was consistent within each body region as well as overall. CONCLUSIONS: The ICDPIC tool performs well in classifying injury severity and is superior to ICD-9-CM embedded severity for TBI. Use of ICDPIC demonstrates substantial efficiency and may be a preferred tool in determining injury severity for large trauma datasets, provided researchers understand its limitations and take caution when examining smaller trauma datasets.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

In the United States, poverty has been historically higher and disproportionately concentrated in the American South. Despite this fact, much of the conventional poverty literature in the United States has focused on urban poverty in cities, particularly in the Northeast and Midwest. Relatively less American poverty research has focused on the enduring economic distress in the South, which Wimberley (2008:899) calls “a neglected regional crisis of historic and contemporary urgency.” Accordingly, this dissertation contributes to the inequality literature by focusing much needed attention on poverty in the South.

Each empirical chapter focuses on a different aspect of poverty in the South. Chapter 2 examines why poverty is higher in the South relative to the Non-South. Chapter 3 focuses on poverty predictors within the South and whether there are differences in the sub-regions of the Deep South and Peripheral South. These two chapters compare the roles of family demography, economic structure, racial/ethnic composition and heterogeneity, and power resources in shaping poverty. Chapter 4 examines whether poverty in the South has been shaped by historical racial regimes.

The Luxembourg Income Study (LIS) United States datasets (2000, 2004, 2007, 2010, and 2013) (derived from the U.S. Census Current Population Survey (CPS) Annual Social and Economic Supplement) provide all the individual-level data for this study. The LIS sample of 745,135 individuals is nested in rich economic, political, and racial state-level data compiled from multiple sources (e.g. U.S. Census Bureau, U.S. Department of Agriculture, University of Kentucky Center for Poverty Research, etc.). Analyses involve a combination of techniques including linear probability regression models to predict poverty and binary decomposition of poverty differences.

Chapter 2 results suggest that power resources, followed by economic structure, are most important in explaining the higher poverty in the South. This underscores the salience of political and economic contexts in shaping poverty across place. Chapter 3 results indicate that individual-level economic factors are the largest predictors of poverty within the South, and even more so in the Deep South. Moreover, divergent results between the South, Deep South, and Peripheral South illustrate how the impact of poverty predictors can vary in different contexts. Chapter 4 results show significant bivariate associations between historical race regimes and poverty among Southern states, although regression models fail to yield significant effects. Conversely, historical race regimes do have a small, but significant effect in explaining the Black-White poverty gap. Results also suggest that employment and education are key to understanding poverty among Blacks and the Black-White poverty gap. Collectively, these chapters underscore why place is so important for understanding poverty and inequality. They also illustrate the salience of micro and macro characteristics of place for helping create, maintain, and reproduce systems of inequality across place.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

cERMIT is a computationally efficient motif discovery tool based on analyzing genome-wide quantitative regulatory evidence. Instead of pre-selecting promising candidate sequences, it utilizes information across all sequence regions to search for high-scoring motifs. We apply cERMIT on a range of direct binding and overexpression datasets; it substantially outperforms state-of-the-art approaches on curated ChIP-chip datasets, and easily scales to current mammalian ChIP-seq experiments with data on thousands of non-coding regions.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Determination of copy number variants (CNVs) inferred in genome wide single nucleotide polymorphism arrays has shown increasing utility in genetic variant disease associations. Several CNV detection methods are available, but differences in CNV call thresholds and characteristics exist. We evaluated the relative performance of seven methods: circular binary segmentation, CNVFinder, cnvPartition, gain and loss of DNA, Nexus algorithms, PennCNV and QuantiSNP. Tested data included real and simulated Illumina HumHap 550 data from the Singapore cohort study of the risk factors for Myopia (SCORM) and simulated data from Affymetrix 6.0 and platform-independent distributions. The normalized singleton ratio (NSR) is proposed as a metric for parameter optimization before enacting full analysis. We used 10 SCORM samples for optimizing parameter settings for each method and then evaluated method performance at optimal parameters using 100 SCORM samples. The statistical power, false positive rates, and receiver operating characteristic (ROC) curve residuals were evaluated by simulation studies. Optimal parameters, as determined by NSR and ROC curve residuals, were consistent across datasets. QuantiSNP outperformed other methods based on ROC curve residuals over most datasets. Nexus Rank and SNPRank have low specificity and high power. Nexus Rank calls oversized CNVs. PennCNV detects one of the fewest numbers of CNVs.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Family dogs and dog owners offer a potentially powerful way to conduct citizen science to answer questions about animal behavior that are difficult to answer with more conventional approaches. Here we evaluate the quality of the first data on dog cognition collected by citizen scientists using the Dognition.com website. We conducted analyses to understand if data generated by over 500 citizen scientists replicates internally and in comparison to previously published findings. Half of participants participated for free while the other half paid for access. The website provided each participant a temperament questionnaire and instructions on how to conduct a series of ten cognitive tests. Participation required internet access, a dog and some common household items. Participants could record their responses on any PC, tablet or smartphone from anywhere in the world and data were retained on servers. Results from citizen scientists and their dogs replicated a number of previously described phenomena from conventional lab-based research. There was little evidence that citizen scientists manipulated their results. To illustrate the potential uses of relatively large samples of citizen science data, we then used factor analysis to examine individual differences across the cognitive tasks. The data were best explained by multiple factors in support of the hypothesis that nonhumans, including dogs, can evolve multiple cognitive domains that vary independently. This analysis suggests that in the future, citizen scientists will generate useful datasets that test hypotheses and answer questions as a complement to conventional laboratory techniques used to study dog psychology.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Despite an emerging understanding of the genetic alterations giving rise to various tumors, the mechanisms whereby most oncogenes are overexpressed remain unclear. Here we have utilized an integrated approach of genomewide regulatory element mapping via DNase-seq followed by conventional reporter assays and transcription factor binding site discovery to characterize the transcriptional regulation of the medulloblastoma oncogene Orthodenticle Homeobox 2 (OTX2). Through these studies we have revealed that OTX2 is differentially regulated in medulloblastoma at the level of chromatin accessibility, which is in part mediated by DNA methylation. In cell lines exhibiting chromatin accessibility of OTX2 regulatory regions, we found that autoregulation maintains OTX2 expression. Comparison of medulloblastoma regulatory elements with those of the developing brain reveals that these tumors engage a developmental regulatory program to drive OTX2 transcription. Finally, we have identified a transcriptional regulatory element mediating retinoid-induced OTX2 repression in these tumors. This work characterizes for the first time the mechanisms of OTX2 overexpression in medulloblastoma. Furthermore, this study establishes proof of principle for applying ENCODE datasets towards the characterization of upstream trans-acting factors mediating expression of individual genes.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

UNLABELLED: • PREMISE OF THE STUDY: Understanding fern (monilophyte) phylogeny and its evolutionary timescale is critical for broad investigations of the evolution of land plants, and for providing the point of comparison necessary for studying the evolution of the fern sister group, seed plants. Molecular phylogenetic investigations have revolutionized our understanding of fern phylogeny, however, to date, these studies have relied almost exclusively on plastid data.• METHODS: Here we take a curated phylogenomics approach to infer the first broad fern phylogeny from multiple nuclear loci, by combining broad taxon sampling (73 ferns and 12 outgroup species) with focused character sampling (25 loci comprising 35877 bp), along with rigorous alignment, orthology inference and model selection.• KEY RESULTS: Our phylogeny corroborates some earlier inferences and provides novel insights; in particular, we find strong support for Equisetales as sister to the rest of ferns, Marattiales as sister to leptosporangiate ferns, and Dennstaedtiaceae as sister to the eupolypods. Our divergence-time analyses reveal that divergences among the extant fern orders all occurred prior to ∼200 MYA. Finally, our species-tree inferences are congruent with analyses of concatenated data, but generally with lower support. Those cases where species-tree support values are higher than expected involve relationships that have been supported by smaller plastid datasets, suggesting that deep coalescence may be reducing support from the concatenated nuclear data.• CONCLUSIONS: Our study demonstrates the utility of a curated phylogenomics approach to inferring fern phylogeny, and highlights the need to consider underlying data characteristics, along with data quantity, in phylogenetic studies.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

With increasing recognition of the roles RNA molecules and RNA/protein complexes play in an unexpected variety of biological processes, understanding of RNA structure-function relationships is of high current importance. To make clean biological interpretations from three-dimensional structures, it is imperative to have high-quality, accurate RNA crystal structures available, and the community has thoroughly embraced that goal. However, due to the many degrees of freedom inherent in RNA structure (especially for the backbone), it is a significant challenge to succeed in building accurate experimental models for RNA structures. This chapter describes the tools and techniques our research group and our collaborators have developed over the years to help RNA structural biologists both evaluate and achieve better accuracy. Expert analysis of large, high-resolution, quality-conscious RNA datasets provides the fundamental information that enables automated methods for robust and efficient error diagnosis in validating RNA structures at all resolutions. The even more crucial goal of correcting the diagnosed outliers has steadily developed toward highly effective, computationally based techniques. Automation enables solving complex issues in large RNA structures, but cannot circumvent the need for thoughtful examination of local details, and so we also provide some guidance for interpreting and acting on the results of current structure validation for RNA.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

BACKGROUND: Determining the evolutionary relationships among the major lineages of extant birds has been one of the biggest challenges in systematic biology. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders. We used these genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomic analyses. FINDINGS: Here we present the datasets associated with the phylogenomic analyses, which include sequence alignment files consisting of nucleotides, amino acids, indels, and transposable elements, as well as tree files containing gene trees and species trees. Inferring an accurate phylogeny required generating: 1) A well annotated data set across species based on genome synteny; 2) Alignments with unaligned or incorrectly overaligned sequences filtered out; and 3) Diverse data sets, including genes and their inferred trees, indels, and transposable elements. Our total evidence nucleotide tree (TENT) data set (consisting of exons, introns, and UCEs) gave what we consider our most reliable species tree when using the concatenation-based ExaML algorithm or when using statistical binning with the coalescence-based MP-EST algorithm (which we refer to as MP-EST*). Other data sets, such as the coding sequence of some exons, revealed other properties of genome evolution, namely convergence. CONCLUSIONS: The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date that we are aware of. The sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.

Relevância:

10.00% 10.00%

Publicador:

Resumo:

Transcriptional regulation has been studied intensively in recent decades. One important aspect of this regulation is the interaction between regulatory proteins, such as transcription factors (TF) and nucleosomes, and the genome. Different high-throughput techniques have been invented to map these interactions genome-wide, including ChIP-based methods (ChIP-chip, ChIP-seq, etc.), nuclease digestion methods (DNase-seq, MNase-seq, etc.), and others. However, a single experimental technique often only provides partial and noisy information about the whole picture of protein-DNA interactions. Therefore, the overarching goal of this dissertation is to provide computational developments for jointly modeling different experimental datasets to achieve a holistic inference on the protein-DNA interaction landscape.

We first present a computational framework that can incorporate the protein binding information in MNase-seq data into a thermodynamic model of protein-DNA interaction. We use a correlation-based objective function to model the MNase-seq data and a Markov chain Monte Carlo method to maximize the function. Our results show that the inferred protein-DNA interaction landscape is concordant with the MNase-seq data and provides a mechanistic explanation for the experimentally collected MNase-seq fragments. Our framework is flexible and can easily incorporate other data sources. To demonstrate this flexibility, we use prior distributions to integrate experimentally measured protein concentrations.

We also study the ability of DNase-seq data to position nucleosomes. Traditionally, DNase-seq has only been widely used to identify DNase hypersensitive sites, which tend to be open chromatin regulatory regions devoid of nucleosomes. We reveal for the first time that DNase-seq datasets also contain substantial information about nucleosome translational positioning, and that existing DNase-seq data can be used to infer nucleosome positions with high accuracy. We develop a Bayes-factor-based nucleosome scoring method to position nucleosomes using DNase-seq data. Our approach utilizes several effective strategies to extract nucleosome positioning signals from the noisy DNase-seq data, including jointly modeling data points across the nucleosome body and explicitly modeling the quadratic and oscillatory DNase I digestion pattern on nucleosomes. We show that our DNase-seq-based nucleosome map is highly consistent with previous high-resolution maps. We also show that the oscillatory DNase I digestion pattern is useful in revealing the nucleosome rotational context around TF binding sites.

Finally, we present a state-space model (SSM) for jointly modeling different kinds of genomic data to provide an accurate view of the protein-DNA interaction landscape. We also provide an efficient expectation-maximization algorithm to learn model parameters from data. We first show in simulation studies that the SSM can effectively recover underlying true protein binding configurations. We then apply the SSM to model real genomic data (both DNase-seq and MNase-seq data). Through incrementally increasing the types of genomic data in the SSM, we show that different data types can contribute complementary information for the inference of protein binding landscape and that the most accurate inference comes from modeling all available datasets.

This dissertation provides a foundation for future research by taking a step toward the genome-wide inference of protein-DNA interaction landscape through data integration.