931 resultados para large data sets
Resumo:
General Summary Although the chapters of this thesis address a variety of issues, the principal aim is common: test economic ideas in an international economic context. The intention has been to supply empirical findings using the largest suitable data sets and making use of the most appropriate empirical techniques. This thesis can roughly be divided into two parts: the first one, corresponding to the first two chapters, investigates the link between trade and the environment, the second one, the last three chapters, is related to economic geography issues. Environmental problems are omnipresent in the daily press nowadays and one of the arguments put forward is that globalisation causes severe environmental problems through the reallocation of investments and production to countries with less stringent environmental regulations. A measure of the amplitude of this undesirable effect is provided in the first part. The third and the fourth chapters explore the productivity effects of agglomeration. The computed spillover effects between different sectors indicate how cluster-formation might be productivity enhancing. The last chapter is not about how to better understand the world but how to measure it and it was just a great pleasure to work on it. "The Economist" writes every week about the impressive population and economic growth observed in China and India, and everybody agrees that the world's center of gravity has shifted. But by how much and how fast did it shift? An answer is given in the last part, which proposes a global measure for the location of world production and allows to visualize our results in Google Earth. A short summary of each of the five chapters is provided below. The first chapter, entitled "Unraveling the World-Wide Pollution-Haven Effect" investigates the relative strength of the pollution haven effect (PH, comparative advantage in dirty products due to differences in environmental regulation) and the factor endowment effect (FE, comparative advantage in dirty, capital intensive products due to differences in endowments). We compute the pollution content of imports using the IPPS coefficients (for three pollutants, namely biological oxygen demand, sulphur dioxide and toxic pollution intensity for all manufacturing sectors) provided by the World Bank and use a gravity-type framework to isolate the two above mentioned effects. Our study covers 48 countries that can be classified into 29 Southern and 19 Northern countries and uses the lead content of gasoline as proxy for environmental stringency. For North-South trade we find significant PH and FE effects going in the expected, opposite directions and being of similar magnitude. However, when looking at world trade, the effects become very small because of the high North-North trade share, where we have no a priori expectations about the signs of these effects. Therefore popular fears about the trade effects of differences in environmental regulations might by exaggerated. The second chapter is entitled "Is trade bad for the Environment? Decomposing worldwide SO2 emissions, 1990-2000". First we construct a novel and large database containing reasonable estimates of SO2 emission intensities per unit labor that vary across countries, periods and manufacturing sectors. Then we use these original data (covering 31 developed and 31 developing countries) to decompose the worldwide SO2 emissions into the three well known dynamic effects (scale, technique and composition effect). We find that the positive scale (+9,5%) and the negative technique (-12.5%) effect are the main driving forces of emission changes. Composition effects between countries and sectors are smaller, both negative and of similar magnitude (-3.5% each). Given that trade matters via the composition effects this means that trade reduces total emissions. We next construct, in a first experiment, a hypothetical world where no trade happens, i.e. each country produces its imports at home and does no longer produce its exports. The difference between the actual and this no-trade world allows us (under the omission of price effects) to compute a static first-order trade effect. The latter now increases total world emissions because it allows, on average, dirty countries to specialize in dirty products. However, this effect is smaller (3.5%) in 2000 than in 1990 (10%), in line with the negative dynamic composition effect identified in the previous exercise. We then propose a second experiment, comparing effective emissions with the maximum or minimum possible level of SO2 emissions. These hypothetical levels of emissions are obtained by reallocating labour accordingly across sectors within each country (under the country-employment and the world industry-production constraints). Using linear programming techniques, we show that emissions are reduced by 90% with respect to the worst case, but that they could still be reduced further by another 80% if emissions were to be minimized. The findings from this chapter go together with those from chapter one in the sense that trade-induced composition effect do not seem to be the main source of pollution, at least in the recent past. Going now to the economic geography part of this thesis, the third chapter, entitled "A Dynamic Model with Sectoral Agglomeration Effects" consists of a short note that derives the theoretical model estimated in the fourth chapter. The derivation is directly based on the multi-regional framework by Ciccone (2002) but extends it in order to include sectoral disaggregation and a temporal dimension. This allows us formally to write present productivity as a function of past productivity and other contemporaneous and past control variables. The fourth chapter entitled "Sectoral Agglomeration Effects in a Panel of European Regions" takes the final equation derived in chapter three to the data. We investigate the empirical link between density and labour productivity based on regional data (245 NUTS-2 regions over the period 1980-2003). Using dynamic panel techniques allows us to control for the possible endogeneity of density and for region specific effects. We find a positive long run elasticity of density with respect to labour productivity of about 13%. When using data at the sectoral level it seems that positive cross-sector and negative own-sector externalities are present in manufacturing while financial services display strong positive own-sector effects. The fifth and last chapter entitled "Is the World's Economic Center of Gravity Already in Asia?" computes the world economic, demographic and geographic center of gravity for 1975-2004 and compares them. Based on data for the largest cities in the world and using the physical concept of center of mass, we find that the world's economic center of gravity is still located in Europe, even though there is a clear shift towards Asia. To sum up, this thesis makes three main contributions. First, it provides new estimates of orders of magnitudes for the role of trade in the globalisation and environment debate. Second, it computes reliable and disaggregated elasticities for the effect of density on labour productivity in European regions. Third, it allows us, in a geometrically rigorous way, to track the path of the world's economic center of gravity.
Resumo:
The class of Schoenberg transformations, embedding Euclidean distances into higher dimensional Euclidean spaces, is presented, and derived from theorems on positive definite and conditionally negative definite matrices. Original results on the arc lengths, angles and curvature of the transformations are proposed, and visualized on artificial data sets by classical multidimensional scaling. A distance-based discriminant algorithm and a robust multidimensional centroid estimate illustrate the theory, closely connected to the Gaussian kernels of Machine Learning.
Resumo:
DnaSP is a software package for the analysis of DNA polymorphism data. Present version introduces several new modules and features which, among other options allow: (1) handling big data sets (~5 Mb per sequence); (2) conducting a large number of coalescent-based tests by Monte Carlo computer simulations; (3) extensive analyses of the genetic differentiation and gene flow among populations; (4) analysing the evolutionary pattern of preferred and unpreferred codons; (5) generating graphical outputs for an easy visualization of results. Availability: The software package, including complete documentation and examples, is freely available to academic users from: http://www.ub.es/dnasp
Resumo:
We conduct a large-scale comparative study on linearly combining superparent-one-dependence estimators (SPODEs), a popular family of seminaive Bayesian classifiers. Altogether, 16 model selection and weighing schemes, 58 benchmark data sets, and various statistical tests are employed. This paper's main contributions are threefold. First, it formally presents each scheme's definition, rationale, and time complexity and hence can serve as a comprehensive reference for researchers interested in ensemble learning. Second, it offers bias-variance analysis for each scheme's classification error performance. Third, it identifies effective schemes that meet various needs in practice. This leads to accurate and fast classification algorithms which have an immediate and significant impact on real-world applications. Another important feature of our study is using a variety of statistical tests to evaluate multiple learning methods across multiple data sets.
Resumo:
One major methodological problem in analysis of sequence data is the determination of costs from which distances between sequences are derived. Although this problem is currently not optimally dealt with in the social sciences, it has some similarity with problems that have been solved in bioinformatics for three decades. In this article, the authors propose an optimization of substitution and deletion/insertion costs based on computational methods. The authors provide an empirical way of determining costs for cases, frequent in the social sciences, in which theory does not clearly promote one cost scheme over another. Using three distinct data sets, the authors tested the distances and cluster solutions produced by the new cost scheme in comparison with solutions based on cost schemes associated with other research strategies. The proposed method performs well compared with other cost-setting strategies, while it alleviates the justification problem of cost schemes.
Resumo:
The temporal dynamics of species diversity are shaped by variations in the rates of speciation and extinction, and there is a long history of inferring these rates using first and last appearances of taxa in the fossil record. Understanding diversity dynamics critically depends on unbiased estimates of the unobserved times of speciation and extinction for all lineages, but the inference of these parameters is challenging due to the complex nature of the available data. Here, we present a new probabilistic framework to jointly estimate species-specific times of speciation and extinction and the rates of the underlying birth-death process based on the fossil record. The rates are allowed to vary through time independently of each other, and the probability of preservation and sampling is explicitly incorporated in the model to estimate the true lifespan of each lineage. We implement a Bayesian algorithm to assess the presence of rate shifts by exploring alternative diversification models. Tests on a range of simulated data sets reveal the accuracy and robustness of our approach against violations of the underlying assumptions and various degrees of data incompleteness. Finally, we demonstrate the application of our method with the diversification of the mammal family Rhinocerotidae and reveal a complex history of repeated and independent temporal shifts of both speciation and extinction rates, leading to the expansion and subsequent decline of the group. The estimated parameters of the birth-death process implemented here are directly comparable with those obtained from dated molecular phylogenies. Thus, our model represents a step towards integrating phylogenetic and fossil information to infer macroevolutionary processes.
Resumo:
Predictive groundwater modeling requires accurate information about aquifer characteristics. Geophysical imaging is a powerful tool for delineating aquifer properties at an appropriate scale and resolution, but it suffers from problems of ambiguity. One way to overcome such limitations is to adopt a simultaneous multitechnique inversion strategy. We have developed a methodology for aquifer characterization based on structural joint inversion of multiple geophysical data sets followed by clustering to form zones and subsequent inversion for zonal parameters. Joint inversions based on cross-gradient structural constraints require less restrictive assumptions than, say, applying predefined petro-physical relationships and generally yield superior results. This approach has, for the first time, been applied to three geophysical data types in three dimensions. A classification scheme using maximum likelihood estimation is used to determine the parameters of a Gaussian mixture model that defines zonal geometries from joint-inversion tomograms. The resulting zones are used to estimate representative geophysical parameters of each zone, which are then used for field-scale petrophysical analysis. A synthetic study demonstrated how joint inversion of seismic and radar traveltimes and electrical resistance tomography (ERT) data greatly reduces misclassification of zones (down from 21.3% to 3.7%) and improves the accuracy of retrieved zonal parameters (from 1.8% to 0.3%) compared to individual inversions. We applied our scheme to a data set collected in northeastern Switzerland to delineate lithologic subunits within a gravel aquifer. The inversion models resolve three principal subhorizontal units along with some important 3D heterogeneity. Petro-physical analysis of the zonal parameters indicated approximately 30% variation in porosity within the gravel aquifer and an increasing fraction of finer sediments with depth.
Resumo:
Selectome (http://selectome.unil.ch/) is a database of positive selection, based on a branch-site likelihood test. This model estimates the number of nonsynonymous substitutions (dN) and synonymous substitutions (dS) to evaluate the variation in selective pressure (dN/dS ratio) over branches and over sites. Since the original release of Selectome, we have benchmarked and implemented a thorough quality control procedure on multiple sequence alignments, aiming to provide minimum false-positive results. We have also improved the computational efficiency of the branch-site test implementation, allowing larger data sets and more frequent updates. Release 6 of Selectome includes all gene trees from Ensembl for Primates and Glires, as well as a large set of vertebrate gene trees. A total of 6810 gene trees have some evidence of positive selection. Finally, the web interface has been improved to be more responsive and to facilitate searches and browsing.
Resumo:
The quantification of gene expression at the single cell level uncovers novel regulatory mechanisms obscured in measurements performed at the population level. Two methods based on microscopy and flow cytometry are presented to demonstrate how such data can be acquired. The expression of a fluorescent reporter induced upon activation of the high osmolarity glycerol MAPK pathway in yeast is used as an example. The specific advantages of each method are highlighted. Flow cytometry measures a large number of cells (10,000) and provides a direct measure of the dynamics of protein expression independent of the slow maturation kinetics of the fluorescent protein. Imaging of living cells by microscopy is by contrast limited to the measurement of the matured form of the reporter in fewer cells. However, the data sets generated by this technique can be extremely rich thanks to the combinations of multiple reporters and to the spatial and temporal information obtained from individual cells. The combination of these two measurement methods can deliver new insights on the regulation of protein expression by signaling pathways.
Resumo:
Gene set enrichment (GSE) analysis is a popular framework for condensing information from gene expression profiles into a pathway or signature summary. The strengths of this approach over single gene analysis include noise and dimension reduction, as well as greater biological interpretability. As molecular profiling experiments move beyond simple case-control studies, robust and flexible GSE methodologies are needed that can model pathway activity within highly heterogeneous data sets. To address this challenge, we introduce Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner. We demonstrate the robustness of GSVA in a comparison with current state of the art sample-wise enrichment methods. Further, we provide examples of its utility in differential pathway activity and survival analysis. Lastly, we show how GSVA works analogously with data from both microarray and RNA-seq experiments. GSVA provides increased power to detect subtle pathway activity changes over a sample population in comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part of the Bioconductor project and can be downloaded at http://www.bioconductor.org.
Resumo:
Within the ENCODE Consortium, GENCODE aimed to accurately annotate all protein-coding genes, pseudogenes, and noncoding transcribed loci in the human genome through manual curation and computational methods. Annotated transcript structures were assessed, and less well-supported loci were systematically, experimentally validated. Predicted exon-exon junctions were evaluated by RT-PCR amplification followed by highly multiplexed sequencing readout, a method we called RT-PCR-seq. Seventy-nine percent of all assessed junctions are confirmed by this evaluation procedure, demonstrating the high quality of the GENCODE gene set. RT-PCR-seq was also efficient to screen gene models predicted using the Human Body Map (HBM) RNA-seq data. We validated 73% of these predictions, thus confirming 1168 novel genes, mostly noncoding, which will further complement the GENCODE annotation. Our novel experimental validation pipeline is extremely sensitive, far more than unbiased transcriptome profiling through RNA sequencing, which is becoming the norm. For example, exon-exon junctions unique to GENCODE annotated transcripts are five times more likely to be corroborated with our targeted approach than with extensive large human transcriptome profiling. Data sets such as the HBM and ENCODE RNA-seq data fail sampling of low-expressed transcripts. Our RT-PCR-seq targeted approach also has the advantage of identifying novel exons of known genes, as we discovered unannotated exons in ~11% of assessed introns. We thus estimate that at least 18% of known loci have yet-unannotated exons. Our work demonstrates that the cataloging of all of the genic elements encoded in the human genome will necessitate a coordinated effort between unbiased and targeted approaches, like RNA-seq and RT-PCR-seq.
Resumo:
US Geological Survey (USGS) based elevation data are the most commonly used data source for highway hydraulic analysis; however, due to the vertical accuracy of USGS-based elevation data, USGS data may be too “coarse” to adequately describe surface profiles of watershed areas or drainage patterns. Additionally hydraulic design requires delineation of much smaller drainage areas (watersheds) than other hydrologic applications, such as environmental, ecological, and water resource management. This research study investigated whether higher resolution LIDAR based surface models would provide better delineation of watersheds and drainage patterns as compared to surface models created from standard USGS-based elevation data. Differences in runoff values were the metric used to compare the data sets. The two data sets were compared for a pilot study area along the Iowa 1 corridor between Iowa City and Mount Vernon. Given the limited breadth of the analysis corridor, areas of particular emphasis were the location of drainage area boundaries and flow patterns parallel to and intersecting the road cross section. Traditional highway hydrology does not appear to be significantly impacted, or benefited, by the increased terrain detail that LIDAR provided for the study area. In fact, hydrologic outputs, such as streams and watersheds, may be too sensitive to the increased horizontal resolution and/or errors in the data set. However, a true comparison of LIDAR and USGS-based data sets of equal size and encompassing entire drainage areas could not be performed in this study. Differences may also result in areas with much steeper slopes or significant changes in terrain. LIDAR may provide possibly valuable detail in areas of modified terrain, such as roads. Better representations of channel and terrain detail in the vicinity of the roadway may be useful in modeling problem drainage areas and evaluating structural surety during and after significant storm events. Furthermore, LIDAR may be used to verify the intended/expected drainage patterns at newly constructed highways. LIDAR will likely provide the greatest benefit for highway projects in flood plains and areas with relatively flat terrain where slight changes in terrain may have a significant impact on drainage patterns.
Resumo:
Vagueness and high dimensional space data are usual features of current data. The paper is an approach to identify conceptual structures among fuzzy three dimensional data sets in order to get conceptual hierarchy. We propose a fuzzy extension of the Galois connections that allows to demonstrate an isomorphism theorem between fuzzy sets closures which is the basis for generating lattices ordered-sets
Resumo:
Genotypic frequencies at codominant marker loci in population samples convey information on mating systems. A classical way to extract this information is to measure heterozygote deficiencies (FIS) and obtain the selfing rate s from FIS = s/(2 - s), assuming inbreeding equilibrium. A major drawback is that heterozygote deficiencies are often present without selfing, owing largely to technical artefacts such as null alleles or partial dominance. We show here that, in the absence of gametic disequilibrium, the multilocus structure can be used to derive estimates of s independent of FIS and free of technical biases. Their statistical power and precision are comparable to those of FIS, although they are sensitive to certain types of gametic disequilibria, a bias shared with progeny-array methods but not FIS. We analyse four real data sets spanning a range of mating systems. In two examples, we obtain s = 0 despite positive FIS, strongly suggesting that the latter are artefactual. In the remaining examples, all estimates are consistent. All the computations have been implemented in a open-access and user-friendly software called rmes (robust multilocus estimate of selfing) available at http://ftp.cefe.cnrs.fr, and can be used on any multilocus data. Being able to extract the reliable information from imperfect data, our method opens the way to make use of the ever-growing number of published population genetic studies, in addition to the more demanding progeny-array approaches, to investigate selfing rates.
Resumo:
Anorexia nervosa (AN) is a complex and heritable eating disorder characterized by dangerously low body weight. Neither candidate gene studies nor an initial genome-wide association study (GWAS) have yielded significant and replicated results. We performed a GWAS in 2907 cases with AN from 14 countries (15 sites) and 14 860 ancestrally matched controls as part of the Genetic Consortium for AN (GCAN) and the Wellcome Trust Case Control Consortium 3 (WTCCC3). Individual association analyses were conducted in each stratum and meta-analyzed across all 15 discovery data sets. Seventy-six (72 independent) single nucleotide polymorphisms were taken forward for in silico (two data sets) or de novo (13 data sets) replication genotyping in 2677 independent AN cases and 8629 European ancestry controls along with 458 AN cases and 421 controls from Japan. The final global meta-analysis across discovery and replication data sets comprised 5551 AN cases and 21 080 controls. AN subtype analyses (1606 AN restricting; 1445 AN binge-purge) were performed. No findings reached genome-wide significance. Two intronic variants were suggestively associated: rs9839776 (P=3.01 × 10(-7)) in SOX2OT and rs17030795 (P=5.84 × 10(-6)) in PPP3CA. Two additional signals were specific to Europeans: rs1523921 (P=5.76 × 10(-)(6)) between CUL3 and FAM124B and rs1886797 (P=8.05 × 10(-)(6)) near SPATA13. Comparing discovery with replication results, 76% of the effects were in the same direction, an observation highly unlikely to be due to chance (P=4 × 10(-6)), strongly suggesting that true findings exist but our sample, the largest yet reported, was underpowered for their detection. The accrual of large genotyped AN case-control samples should be an immediate priority for the field.