970 resultados para Datasets
Resumo:
Genome-wide association studies (GWAS) have now identified at least 2,000 common variants that appear associated with common diseases or related traits (http://www.genome.gov/gwastudies), hundreds of which have been convincingly replicated. It is generally thought that the associated markers reflect the effect of a nearby common (minor allele frequency >0.05) causal site, which is associated with the marker, leading to extensive resequencing efforts to find causal sites. We propose as an alternative explanation that variants much less common than the associated one may create "synthetic associations" by occurring, stochastically, more often in association with one of the alleles at the common site versus the other allele. Although synthetic associations are an obvious theoretical possibility, they have never been systematically explored as a possible explanation for GWAS findings. Here, we use simple computer simulations to show the conditions under which such synthetic associations will arise and how they may be recognized. We show that they are not only possible, but inevitable, and that under simple but reasonable genetic models, they are likely to account for or contribute to many of the recently identified signals reported in genome-wide association studies. We also illustrate the behavior of synthetic associations in real datasets by showing that rare causal mutations responsible for both hearing loss and sickle cell anemia create genome-wide significant synthetic associations, in the latter case extending over a 2.5-Mb interval encompassing scores of "blocks" of associated variants. In conclusion, uncommon or rare genetic variants can easily create synthetic associations that are credited to common variants, and this possibility requires careful consideration in the interpretation and follow up of GWAS signals.
Resumo:
Tumor microenvironmental stresses, such as hypoxia and lactic acidosis, play important roles in tumor progression. Although gene signatures reflecting the influence of these stresses are powerful approaches to link expression with phenotypes, they do not fully reflect the complexity of human cancers. Here, we describe the use of latent factor models to further dissect the stress gene signatures in a breast cancer expression dataset. The genes in these latent factors are coordinately expressed in tumors and depict distinct, interacting components of the biological processes. The genes in several latent factors are highly enriched in chromosomal locations. When these factors are analyzed in independent datasets with gene expression and array CGH data, the expression values of these factors are highly correlated with copy number alterations (CNAs) of the corresponding BAC clones in both the cell lines and tumors. Therefore, variation in the expression of these pathway-associated factors is at least partially caused by variation in gene dosage and CNAs among breast cancers. We have also found the expression of two latent factors without any chromosomal enrichment is highly associated with 12q CNA, likely an instance of "trans"-variations in which CNA leads to the variations in gene expression outside of the CNA region. In addition, we have found that factor 26 (1q CNA) is negatively correlated with HIF-1alpha protein and hypoxia pathways in breast tumors and cell lines. This agrees with, and for the first time links, known good prognosis associated with both a low hypoxia signature and the presence of CNA in this region. Taken together, these results suggest the possibility that tumor segmental aneuploidy makes significant contributions to variation in the lactic acidosis/hypoxia gene signatures in human cancers and demonstrate that latent factor analysis is a powerful means to uncover such a linkage.
Resumo:
While genome-wide gene expression data are generated at an increasing rate, the repertoire of approaches for pattern discovery in these data is still limited. Identifying subtle patterns of interest in large amounts of data (tens of thousands of profiles) associated with a certain level of noise remains a challenge. A microarray time series was recently generated to study the transcriptional program of the mouse segmentation clock, a biological oscillator associated with the periodic formation of the segments of the body axis. A method related to Fourier analysis, the Lomb-Scargle periodogram, was used to detect periodic profiles in the dataset, leading to the identification of a novel set of cyclic genes associated with the segmentation clock. Here, we applied to the same microarray time series dataset four distinct mathematical methods to identify significant patterns in gene expression profiles. These methods are called: Phase consistency, Address reduction, Cyclohedron test and Stable persistence, and are based on different conceptual frameworks that are either hypothesis- or data-driven. Some of the methods, unlike Fourier transforms, are not dependent on the assumption of periodicity of the pattern of interest. Remarkably, these methods identified blindly the expression profiles of known cyclic genes as the most significant patterns in the dataset. Many candidate genes predicted by more than one approach appeared to be true positive cyclic genes and will be of particular interest for future research. In addition, these methods predicted novel candidate cyclic genes that were consistent with previous biological knowledge and experimental validation in mouse embryos. Our results demonstrate the utility of these novel pattern detection strategies, notably for detection of periodic profiles, and suggest that combining several distinct mathematical approaches to analyze microarray datasets is a valuable strategy for identifying genes that exhibit novel, interesting transcriptional patterns.
Resumo:
BACKGROUND: Dropouts and missing data are nearly-ubiquitous in obesity randomized controlled trails, threatening validity and generalizability of conclusions. Herein, we meta-analytically evaluate the extent of missing data, the frequency with which various analytic methods are employed to accommodate dropouts, and the performance of multiple statistical methods. METHODOLOGY/PRINCIPAL FINDINGS: We searched PubMed and Cochrane databases (2000-2006) for articles published in English and manually searched bibliographic references. Articles of pharmaceutical randomized controlled trials with weight loss or weight gain prevention as major endpoints were included. Two authors independently reviewed each publication for inclusion. 121 articles met the inclusion criteria. Two authors independently extracted treatment, sample size, drop-out rates, study duration, and statistical method used to handle missing data from all articles and resolved disagreements by consensus. In the meta-analysis, drop-out rates were substantial with the survival (non-dropout) rates being approximated by an exponential decay curve (e(-lambdat)) where lambda was estimated to be .0088 (95% bootstrap confidence interval: .0076 to .0100) and t represents time in weeks. The estimated drop-out rate at 1 year was 37%. Most studies used last observation carried forward as the primary analytic method to handle missing data. We also obtained 12 raw obesity randomized controlled trial datasets for empirical analyses. Analyses of raw randomized controlled trial data suggested that both mixed models and multiple imputation performed well, but that multiple imputation may be more robust when missing data are extensive. CONCLUSION/SIGNIFICANCE: Our analysis offers an equation for predictions of dropout rates useful for future study planning. Our raw data analyses suggests that multiple imputation is better than other methods for handling missing data in obesity randomized controlled trials, followed closely by mixed models. We suggest these methods supplant last observation carried forward as the primary method of analysis.
Resumo:
New burned area datasets and top-down constraints from atmospheric concentration measurements of pyrogenic gases have decreased the large uncertainty in fire emissions estimates. However, significant gaps remain in our understanding of the contribution of deforestation, savanna, forest, agricultural waste, and peat fires to total global fire emissions. Here we used a revised version of the Carnegie-Ames-Stanford-Approach (CASA) biogeochemical model and improved satellite-derived estimates of area burned, fire activity, and plant productivity to calculate fire emissions for the 1997-2009 period on a 0.5° spatial resolution with a monthly time step. For November 2000 onwards, estimates were based on burned area, active fire detections, and plant productivity from the MODerate resolution Imaging Spectroradiometer (MODIS) sensor. For the partitioning we focused on the MODIS era. We used maps of burned area derived from the Tropical Rainfall Measuring Mission (TRMM) Visible and Infrared Scanner (VIRS) and Along-Track Scanning Radiometer (ATSR) active fire data prior to MODIS (1997-2000) and estimates of plant productivity derived from Advanced Very High Resolution Radiometer (AVHRR) observations during the same period. Average global fire carbon emissions according to this version 3 of the Global Fire Emissions Database (GFED3) were 2.0 PgC year-1 with significant interannual variability during 1997-2001 (2.8 Pg Cyear-1 in 1998 and 1.6 PgC year-1 in 2001). Globally, emissions during 2002-2007 were rela-tively constant (around 2.1 Pg C year-1) before declining in 2008 (1.7 Pg Cyear-1) and 2009 (1.5 PgC year-1) partly due to lower deforestation fire emissions in South America and tropical Asia. On a regional basis, emissions were highly variable during 2002-2007 (e.g., boreal Asia, South America, and Indonesia), but these regional differences canceled out at a global level. During the MODIS era (2001-2009), most carbon emissions were from fires in grasslands and savannas (44%) with smaller contributions from tropical deforestation and degradation fires (20%), woodland fires (mostly confined to the tropics, 16%), forest fires (mostly in the extratropics, 15%), agricultural waste burning (3%), and tropical peat fires (3%). The contribution from agricultural waste fires was likely a lower bound because our approach for measuring burned area could not detect all of these relatively small fires. Total carbon emissions were on average 13% lower than in our previous (GFED2) work. For reduced trace gases such as CO and CH4, deforestation, degradation, and peat fires were more important contributors because of higher emissions of reduced trace gases per unit carbon combusted compared to savanna fires. Carbon emissions from tropical deforestation, degradation, and peatland fires were on average 0.5 PgC year-1. The carbon emissions from these fires may not be balanced by regrowth following fire. Our results provide the first global assessment of the contribution of different sources to total global fire emissions for the past decade, and supply the community with an improved 13-year fire emissions time series. © 2010 Author(s).
Resumo:
A visually apparent but scientifically untested outcome of land-use change is homogenization across urban areas, where neighborhoods in different parts of the country have similar patterns of roads, residential lots, commercial areas, and aquatic features. We hypothesize that this homogenization extends to ecological structure and also to ecosystem functions such as carbon dynamics and microclimate, with continental-scale implications. Further, we suggest that understanding urban homogenization will provide the basis for understanding the impacts of urban land-use change from local to continental scales. Here, we show how multi-scale, multidisciplinary datasets from six metropolitan areas that cover the major climatic regions of the US (Phoenix, AZ; Miami, FL; Baltimore, MD; Boston, MA; Minneapolis-St Paul, MN; and Los Angeles, CA) can be used to determine how household and neighborhood characteristics correlate with land-management practices, land-cover composition, and landscape structure and ecosystem functions at local, regional, and continental scales. © The Ecological Society of America.
Resumo:
Learning multiple tasks across heterogeneous domains is a challenging problem since the feature space may not be the same for different tasks. We assume the data in multiple tasks are generated from a latent common domain via sparse domain transforms and propose a latent probit model (LPM) to jointly learn the domain transforms, and the shared probit classifier in the common domain. To learn meaningful task relatedness and avoid over-fitting in classification, we introduce sparsity in the domain transforms matrices, as well as in the common classifier. We derive theoretical bounds for the estimation error of the classifier in terms of the sparsity of domain transforms. An expectation-maximization algorithm is derived for learning the LPM. The effectiveness of the approach is demonstrated on several real datasets.
Resumo:
We study the problem of supervised linear dimensionality reduction, taking an information-theoretic viewpoint. The linear projection matrix is designed by maximizing the mutual information between the projected signal and the class label. By harnessing a recent theoretical result on the gradient of mutual information, the above optimization problem can be solved directly using gradient descent, without requiring simplification of the objective function. Theoretical analysis and empirical comparison are made between the proposed method and two closely related methods, and comparisons are also made with a method in which Rényi entropy is used to define the mutual information (in this case the gradient may be computed simply, under a special parameter setting). Relative to these alternative approaches, the proposed method achieves promising results on real datasets. Copyright 2012 by the author(s)/owner(s).
Resumo:
BACKGROUND: The evolutionary relationships of modern birds are among the most challenging to understand in systematic biology and have been debated for centuries. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders, and used the genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomics analyses (Jarvis et al. in press; Zhang et al. in press). Here we release assemblies and datasets associated with the comparative genome analyses, which include 38 newly sequenced avian genomes plus previously released or simultaneously released genomes of Chicken, Zebra finch, Turkey, Pigeon, Peregrine falcon, Duck, Budgerigar, Adelie penguin, Emperor penguin and the Medium Ground Finch. We hope that this resource will serve future efforts in phylogenomics and comparative genomics. FINDINGS: The 38 bird genomes were sequenced using the Illumina HiSeq 2000 platform and assembled using a whole genome shotgun strategy. The 48 genomes were categorized into two groups according to the N50 scaffold size of the assemblies: a high depth group comprising 23 species sequenced at high coverage (>50X) with multiple insert size libraries resulting in N50 scaffold sizes greater than 1 Mb (except the White-throated Tinamou and Bald Eagle); and a low depth group comprising 25 species sequenced at a low coverage (~30X) with two insert size libraries resulting in an average N50 scaffold size of about 50 kb. Repetitive elements comprised 4%-22% of the bird genomes. The assembled scaffolds allowed the homology-based annotation of 13,000 ~ 17000 protein coding genes in each avian genome relative to chicken, zebra finch and human, as well as comparative and sequence conservation analyses. CONCLUSIONS: Here we release full genome assemblies of 38 newly sequenced avian species, link genome assembly downloads for the 7 of the remaining 10 species, and provide a guideline of genomic data that has been generated and used in our Avian Phylogenomics Project. To the best of our knowledge, the Avian Phylogenomics Project is the biggest vertebrate comparative genomics project to date. The genomic data presented here is expected to accelerate further analyses in many fields, including phylogenetics, comparative genomics, evolution, neurobiology, development biology, and other related areas.
Resumo:
BACKGROUND: Administrative or quality improvement registries may or may not contain the elements needed for investigations by trauma researchers. International Classification of Diseases Program for Injury Categorisation (ICDPIC), a statistical program available through Stata, is a powerful tool that can extract injury severity scores from ICD-9-CM codes. We conducted a validation study for use of the ICDPIC in trauma research. METHODS: We conducted a retrospective cohort validation study of 40,418 patients with injury using a large regional trauma registry. ICDPIC-generated AIS scores for each body region were compared with trauma registry AIS scores (gold standard) in adult and paediatric populations. A separate analysis was conducted among patients with traumatic brain injury (TBI) comparing the ICDPIC tool with ICD-9-CM embedded severity codes. Performance in characterising overall injury severity, by the ISS, was also assessed. RESULTS: The ICDPIC tool generated substantial correlations in thoracic and abdominal trauma (weighted κ 0.87-0.92), and in head and neck trauma (weighted κ 0.76-0.83). The ICDPIC tool captured TBI severity better than ICD-9-CM code embedded severity and offered the advantage of generating a severity value for every patient (rather than having missing data). Its ability to produce an accurate severity score was consistent within each body region as well as overall. CONCLUSIONS: The ICDPIC tool performs well in classifying injury severity and is superior to ICD-9-CM embedded severity for TBI. Use of ICDPIC demonstrates substantial efficiency and may be a preferred tool in determining injury severity for large trauma datasets, provided researchers understand its limitations and take caution when examining smaller trauma datasets.
Resumo:
In the United States, poverty has been historically higher and disproportionately concentrated in the American South. Despite this fact, much of the conventional poverty literature in the United States has focused on urban poverty in cities, particularly in the Northeast and Midwest. Relatively less American poverty research has focused on the enduring economic distress in the South, which Wimberley (2008:899) calls “a neglected regional crisis of historic and contemporary urgency.” Accordingly, this dissertation contributes to the inequality literature by focusing much needed attention on poverty in the South.
Each empirical chapter focuses on a different aspect of poverty in the South. Chapter 2 examines why poverty is higher in the South relative to the Non-South. Chapter 3 focuses on poverty predictors within the South and whether there are differences in the sub-regions of the Deep South and Peripheral South. These two chapters compare the roles of family demography, economic structure, racial/ethnic composition and heterogeneity, and power resources in shaping poverty. Chapter 4 examines whether poverty in the South has been shaped by historical racial regimes.
The Luxembourg Income Study (LIS) United States datasets (2000, 2004, 2007, 2010, and 2013) (derived from the U.S. Census Current Population Survey (CPS) Annual Social and Economic Supplement) provide all the individual-level data for this study. The LIS sample of 745,135 individuals is nested in rich economic, political, and racial state-level data compiled from multiple sources (e.g. U.S. Census Bureau, U.S. Department of Agriculture, University of Kentucky Center for Poverty Research, etc.). Analyses involve a combination of techniques including linear probability regression models to predict poverty and binary decomposition of poverty differences.
Chapter 2 results suggest that power resources, followed by economic structure, are most important in explaining the higher poverty in the South. This underscores the salience of political and economic contexts in shaping poverty across place. Chapter 3 results indicate that individual-level economic factors are the largest predictors of poverty within the South, and even more so in the Deep South. Moreover, divergent results between the South, Deep South, and Peripheral South illustrate how the impact of poverty predictors can vary in different contexts. Chapter 4 results show significant bivariate associations between historical race regimes and poverty among Southern states, although regression models fail to yield significant effects. Conversely, historical race regimes do have a small, but significant effect in explaining the Black-White poverty gap. Results also suggest that employment and education are key to understanding poverty among Blacks and the Black-White poverty gap. Collectively, these chapters underscore why place is so important for understanding poverty and inequality. They also illustrate the salience of micro and macro characteristics of place for helping create, maintain, and reproduce systems of inequality across place.
Resumo:
cERMIT is a computationally efficient motif discovery tool based on analyzing genome-wide quantitative regulatory evidence. Instead of pre-selecting promising candidate sequences, it utilizes information across all sequence regions to search for high-scoring motifs. We apply cERMIT on a range of direct binding and overexpression datasets; it substantially outperforms state-of-the-art approaches on curated ChIP-chip datasets, and easily scales to current mammalian ChIP-seq experiments with data on thousands of non-coding regions.
Resumo:
Determination of copy number variants (CNVs) inferred in genome wide single nucleotide polymorphism arrays has shown increasing utility in genetic variant disease associations. Several CNV detection methods are available, but differences in CNV call thresholds and characteristics exist. We evaluated the relative performance of seven methods: circular binary segmentation, CNVFinder, cnvPartition, gain and loss of DNA, Nexus algorithms, PennCNV and QuantiSNP. Tested data included real and simulated Illumina HumHap 550 data from the Singapore cohort study of the risk factors for Myopia (SCORM) and simulated data from Affymetrix 6.0 and platform-independent distributions. The normalized singleton ratio (NSR) is proposed as a metric for parameter optimization before enacting full analysis. We used 10 SCORM samples for optimizing parameter settings for each method and then evaluated method performance at optimal parameters using 100 SCORM samples. The statistical power, false positive rates, and receiver operating characteristic (ROC) curve residuals were evaluated by simulation studies. Optimal parameters, as determined by NSR and ROC curve residuals, were consistent across datasets. QuantiSNP outperformed other methods based on ROC curve residuals over most datasets. Nexus Rank and SNPRank have low specificity and high power. Nexus Rank calls oversized CNVs. PennCNV detects one of the fewest numbers of CNVs.
Resumo:
Family dogs and dog owners offer a potentially powerful way to conduct citizen science to answer questions about animal behavior that are difficult to answer with more conventional approaches. Here we evaluate the quality of the first data on dog cognition collected by citizen scientists using the Dognition.com website. We conducted analyses to understand if data generated by over 500 citizen scientists replicates internally and in comparison to previously published findings. Half of participants participated for free while the other half paid for access. The website provided each participant a temperament questionnaire and instructions on how to conduct a series of ten cognitive tests. Participation required internet access, a dog and some common household items. Participants could record their responses on any PC, tablet or smartphone from anywhere in the world and data were retained on servers. Results from citizen scientists and their dogs replicated a number of previously described phenomena from conventional lab-based research. There was little evidence that citizen scientists manipulated their results. To illustrate the potential uses of relatively large samples of citizen science data, we then used factor analysis to examine individual differences across the cognitive tasks. The data were best explained by multiple factors in support of the hypothesis that nonhumans, including dogs, can evolve multiple cognitive domains that vary independently. This analysis suggests that in the future, citizen scientists will generate useful datasets that test hypotheses and answer questions as a complement to conventional laboratory techniques used to study dog psychology.
Resumo:
Despite an emerging understanding of the genetic alterations giving rise to various tumors, the mechanisms whereby most oncogenes are overexpressed remain unclear. Here we have utilized an integrated approach of genomewide regulatory element mapping via DNase-seq followed by conventional reporter assays and transcription factor binding site discovery to characterize the transcriptional regulation of the medulloblastoma oncogene Orthodenticle Homeobox 2 (OTX2). Through these studies we have revealed that OTX2 is differentially regulated in medulloblastoma at the level of chromatin accessibility, which is in part mediated by DNA methylation. In cell lines exhibiting chromatin accessibility of OTX2 regulatory regions, we found that autoregulation maintains OTX2 expression. Comparison of medulloblastoma regulatory elements with those of the developing brain reveals that these tumors engage a developmental regulatory program to drive OTX2 transcription. Finally, we have identified a transcriptional regulatory element mediating retinoid-induced OTX2 repression in these tumors. This work characterizes for the first time the mechanisms of OTX2 overexpression in medulloblastoma. Furthermore, this study establishes proof of principle for applying ENCODE datasets towards the characterization of upstream trans-acting factors mediating expression of individual genes.