881 resultados para Large-scale Analysis
Resumo:
Global communication requirements and load imbalance of some parallel data mining algorithms are the major obstacles to exploit the computational power of large-scale systems. This work investigates how non-uniform data distributions can be exploited to remove the global communication requirement and to reduce the communication cost in iterative parallel data mining algorithms. In particular, the analysis focuses on one of the most influential and popular data mining methods, the k-means algorithm for cluster analysis. The straightforward parallel formulation of the k-means algorithm requires a global reduction operation at each iteration step, which hinders its scalability. This work studies a different parallel formulation of the algorithm where the requirement of global communication can be relaxed while still providing the exact solution of the centralised k-means algorithm. The proposed approach exploits a non-uniform data distribution which can be either found in real world distributed applications or can be induced by means of multi-dimensional binary search trees. The approach can also be extended to accommodate an approximation error which allows a further reduction of the communication costs.
Resumo:
Some recent winters in Western Europe have been characterized by the occurrence of multiple extratropical cyclones following a similar path. The occurrence of such cyclone clusters leads to large socio-economic impacts due to damaging winds, storm surges, and floods. Recent studies have statistically characterized the clustering of extratropical cyclones over the North Atlantic and Europe and hypothesized potential physical mechanisms responsible for their formation. Here we analyze 4 months characterized by multiple cyclones over Western Europe (February 1990, January 1993, December 1999, and January 2007). The evolution of the eddy driven jet stream, Rossby wave-breaking, and upstream/downstream cyclone development are investigated to infer the role of the large-scale flow and to determine if clustered cyclones are related to each other. Results suggest that optimal conditions for the occurrence of cyclone clusters are provided by a recurrent extension of an intensified eddy driven jet toward Western Europe lasting at least 1 week. Multiple Rossby wave-breaking occurrences on both the poleward and equatorward flanks of the jet contribute to the development of these anomalous large-scale conditions. The analysis of the daily weather charts reveals that upstream cyclone development (secondary cyclogenesis, where new cyclones are generated on the trailing fronts of mature cyclones) is strongly related to cyclone clustering, with multiple cyclones developing on a single jet streak. The present analysis permits a deeper understanding of the physical reasons leading to the occurrence of cyclone families over the North Atlantic, enabling a better estimation of the associated cumulative risk over Europe.
Resumo:
A truly variance-minimizing filter is introduced and its per for mance is demonstrated with the Korteweg– DeV ries (KdV) equation and with a multilayer quasigeostrophic model of the ocean area around South Africa. It is recalled that Kalman-like filters are not variance minimizing for nonlinear model dynamics and that four - dimensional variational data assimilation (4DV AR)-like methods relying on per fect model dynamics have dif- ficulty with providing error estimates. The new method does not have these drawbacks. In fact, it combines advantages from both methods in that it does provide error estimates while automatically having balanced states after analysis, without extra computations. It is based on ensemble or Monte Carlo integrations to simulate the probability density of the model evolution. When obser vations are available, the so-called importance resampling algorithm is applied. From Bayes’ s theorem it follows that each ensemble member receives a new weight dependent on its ‘ ‘distance’ ’ t o the obser vations. Because the weights are strongly var ying, a resampling of the ensemble is necessar y. This resampling is done such that members with high weights are duplicated according to their weights, while low-weight members are largely ignored. In passing, it is noted that data assimilation is not an inverse problem by nature, although it can be for mulated that way . Also, it is shown that the posterior variance can be larger than the prior if the usual Gaussian framework is set aside. However , i n the examples presented here, the entropy of the probability densities is decreasing. The application to the ocean area around South Africa, gover ned by strongly nonlinear dynamics, shows that the method is working satisfactorily . The strong and weak points of the method are discussed and possible improvements are proposed.
Resumo:
The relationship between the structure and function of biological networks constitutes a fundamental issue in systems biology. Particularly, the structure of protein-protein interaction networks is related to important biological functions. In this work, we investigated how such a resilience is determined by the large scale features of the respective networks. Four species are taken into account, namely yeast Saccharomyces cerevisiae, worm Caenorhabditis elegans, fly Drosophila melanogaster and Homo sapiens. We adopted two entropy-related measurements (degree entropy and dynamic entropy) in order to quantify the overall degree of robustness of these networks. We verified that while they exhibit similar structural variations under random node removal, they differ significantly when subjected to intentional attacks (hub removal). As a matter of fact, more complex species tended to exhibit more robust networks. More specifically, we quantified how six important measurements of the networks topology (namely clustering coefficient, average degree of neighbors, average shortest path length, diameter, assortativity coefficient, and slope of the power law degree distribution) correlated with the two entropy measurements. Our results revealed that the fraction of hubs and the average neighbor degree contribute significantly for the resilience of networks. In addition, the topological analysis of the removed hubs indicated that the presence of alternative paths between the proteins connected to hubs tend to reinforce resilience. The performed analysis helps to understand how resilience is underlain in networks and can be applied to the development of protein network models.
Resumo:
Analyses of circulating metabolites in large prospective epidemiological studies could lead to improved prediction and better biological understanding of coronary heart disease (CHD). We performed a mass spectrometry-based non-targeted metabolomics study for association with incident CHD events in 1,028 individuals (131 events; 10 y. median follow-up) with validation in 1,670 individuals (282 events; 3.9 y. median follow-up). Four metabolites were replicated and independent of main cardiovascular risk factors [lysophosphatidylcholine 18∶1 (hazard ratio [HR] per standard deviation [SD] increment = 0.77, P-value<0.001), lysophosphatidylcholine 18∶2 (HR = 0.81, P-value<0.001), monoglyceride 18∶2 (MG 18∶2; HR = 1.18, P-value = 0.011) and sphingomyelin 28∶1 (HR = 0.85, P-value = 0.015)]. Together they contributed to moderate improvements in discrimination and re-classification in addition to traditional risk factors (C-statistic: 0.76 vs. 0.75; NRI: 9.2%). MG 18∶2 was associated with CHD independently of triglycerides. Lysophosphatidylcholines were negatively associated with body mass index, C-reactive protein and with less evidence of subclinical cardiovascular disease in additional 970 participants; a reverse pattern was observed for MG 18∶2. MG 18∶2 showed an enrichment (P-value = 0.002) of significant associations with CHD-associated SNPs (P-value = 1.2×10-7 for association with rs964184 in the ZNF259/APOA5 region) and a weak, but positive causal effect (odds ratio = 1.05 per SD increment in MG 18∶2, P-value = 0.05) on CHD, as suggested by Mendelian randomization analysis. In conclusion, we identified four lipid-related metabolites with evidence for clinical utility, as well as a causal role in CHD development.
Resumo:
A detailed genome mapping analysis of 213,636 expressed sequence tags (EST) derived from nontumor and tumor tissues of the oral cavity, larynx, pharynx, and thyroid was done. Transcripts matching known human genes were identified; potential new splice variants were flagged and subjected to manual curation, pointing to 788 putatively new alternative splicing isoforms, the majority (75%) being insertion events. A subset of 34 new splicing isoforms (5% of 788 events) was selected and 23 (68%) were confirmed by reverse transcription-PCR and DNA sequencing. Putative new genes were revealed, including six transcripts mapped to well-studied chromosomes such as 22, as well as transcripts that mapped to 253 intergenic regions. In addition, 2,251 noncoding intronic RNAs, eventually involved in transcriptional regulation, were found. A set of 250 candidate markers for loss of heterozygosis or gene amplification was selected by identifying transcripts that mapped to genomic regions previously known to be frequently amplified or deleted in head, neck, and thyroid tumors. Three of these markers were evaluated by quantitative reverse transcription-PCR in an independent set of individual samples. Along with detailed clinical data about tumor origin, the information reported here is now publicly available on a dedicated Web site as a resource for further biological investigation. This first in silico reconstruction of the head, neck, and thyroid transcriptomes points to a wealth of new candidate markers that can be used for future studies on the molecular basis of these tumors. Similar analysis is warranted for a number of other tumors for which large EST data sets are available.
Resumo:
This paper presents the development of a mathematical model to optimize the management and operation of the Brazilian hydrothermal system. The system consists of a large set of individual hydropower plants and a set of aggregated thermal plants. The energy generated in the system is interconnected by a transmission network so it can be transmitted to centers of consumption throughout the country. The optimization model offered is capable of handling different types of constraints, such as interbasin water transfers, water supply for various purposes, and environmental requirements. Its overall objective is to produce energy to meet the country's demand at a minimum cost. Called HIDROTERM, the model integrates a database with basic hydrological and technical information to run the optimization model, and provides an interface to manage the input and output data. The optimization model uses the General Algebraic Modeling System (GAMS) package and can invoke different linear as well as nonlinear programming solvers. The optimization model was applied to the Brazilian hydrothermal system, one of the largest in the world. The system is divided into four subsystems with 127 active hydropower plants. Preliminary results under different scenarios of inflow, demand, and installed capacity demonstrate the efficiency and utility of the model. From this and other case studies in Brazil, the results indicate that the methodology developed is suitable to different applications, such as planning operation, capacity expansion, and operational rule studies, and trade-off analysis among multiple water users. DOI: 10.1061/(ASCE)WR.1943-5452.0000149. (C) 2012 American Society of Civil Engineers.
Resumo:
The complexity of power systems has increased in recent years due to the operation of existing transmission lines closer to their limits, using flexible AC transmission system (FACTS) devices, and also due to the increased penetration of new types of generators that have more intermittent characteristics and lower inertial response, such as wind generators. This changing nature of a power system has considerable effect on its dynamic behaviors resulting in power swings, dynamic interactions between different power system devices, and less synchronized coupling. This paper presents some analyses of this changing nature of power systems and their dynamic behaviors to identify critical issues that limit the large-scale integration of wind generators and FACTS devices. In addition, this paper addresses some general concerns toward high compensations in different grid topologies. The studies in this paper are conducted on the New England and New York power system model under both small and large disturbances. From the analyses, it can be concluded that high compensation can reduce the security limits under certain operating conditions, and the modes related to operating slip and shaft stiffness are critical as they may limit the large-scale integration of wind generation.
Resumo:
The continuous increase of genome sequencing projects produced a huge amount of data in the last 10 years: currently more than 600 prokaryotic and 80 eukaryotic genomes are fully sequenced and publically available. However the sole sequencing process of a genome is able to determine just raw nucleotide sequences. This is only the first step of the genome annotation process that will deal with the issue of assigning biological information to each sequence. The annotation process is done at each different level of the biological information processing mechanism, from DNA to protein, and cannot be accomplished only by in vitro analysis procedures resulting extremely expensive and time consuming when applied at a this large scale level. Thus, in silico methods need to be used to accomplish the task. The aim of this work was the implementation of predictive computational methods to allow a fast, reliable, and automated annotation of genomes and proteins starting from aminoacidic sequences. The first part of the work was focused on the implementation of a new machine learning based method for the prediction of the subcellular localization of soluble eukaryotic proteins. The method is called BaCelLo, and was developed in 2006. The main peculiarity of the method is to be independent from biases present in the training dataset, which causes the over‐prediction of the most represented examples in all the other available predictors developed so far. This important result was achieved by a modification, made by myself, to the standard Support Vector Machine (SVM) algorithm with the creation of the so called Balanced SVM. BaCelLo is able to predict the most important subcellular localizations in eukaryotic cells and three, kingdom‐specific, predictors were implemented. In two extensive comparisons, carried out in 2006 and 2008, BaCelLo reported to outperform all the currently available state‐of‐the‐art methods for this prediction task. BaCelLo was subsequently used to completely annotate 5 eukaryotic genomes, by integrating it in a pipeline of predictors developed at the Bologna Biocomputing group by Dr. Pier Luigi Martelli and Dr. Piero Fariselli. An online database, called eSLDB, was developed by integrating, for each aminoacidic sequence extracted from the genome, the predicted subcellular localization merged with experimental and similarity‐based annotations. In the second part of the work a new, machine learning based, method was implemented for the prediction of GPI‐anchored proteins. Basically the method is able to efficiently predict from the raw aminoacidic sequence both the presence of the GPI‐anchor (by means of an SVM), and the position in the sequence of the post‐translational modification event, the so called ω‐site (by means of an Hidden Markov Model (HMM)). The method is called GPIPE and reported to greatly enhance the prediction performances of GPI‐anchored proteins over all the previously developed methods. GPIPE was able to predict up to 88% of the experimentally annotated GPI‐anchored proteins by maintaining a rate of false positive prediction as low as 0.1%. GPIPE was used to completely annotate 81 eukaryotic genomes, and more than 15000 putative GPI‐anchored proteins were predicted, 561 of which are found in H. sapiens. In average 1% of a proteome is predicted as GPI‐anchored. A statistical analysis was performed onto the composition of the regions surrounding the ω‐site that allowed the definition of specific aminoacidic abundances in the different considered regions. Furthermore the hypothesis that compositional biases are present among the four major eukaryotic kingdoms, proposed in literature, was tested and rejected. All the developed predictors and databases are freely available at: BaCelLo http://gpcr.biocomp.unibo.it/bacello eSLDB http://gpcr.biocomp.unibo.it/esldb GPIPE http://gpcr.biocomp.unibo.it/gpipe
Resumo:
In the thesis we present the implementation of the quadratic maximum likelihood (QML) method, ideal to estimate the angular power spectrum of the cross-correlation between cosmic microwave background (CMB) and large scale structure (LSS) maps as well as their individual auto-spectra. Such a tool is an optimal method (unbiased and with minimum variance) in pixel space and goes beyond all the previous harmonic analysis present in the literature. We describe the implementation of the QML method in the {\it BolISW} code and demonstrate its accuracy on simulated maps throughout a Monte Carlo. We apply this optimal estimator to WMAP 7-year and NRAO VLA Sky Survey (NVSS) data and explore the robustness of the angular power spectrum estimates obtained by the QML method. Taking into account the shot noise and one of the systematics (declination correction) in NVSS, we can safely use most of the information contained in this survey. On the contrary we neglect the noise in temperature since WMAP is already cosmic variance dominated on the large scales. Because of a discrepancy in the galaxy auto spectrum between the estimates and the theoretical model, we use two different galaxy distributions: the first one with a constant bias $b$ and the second one with a redshift dependent bias $b(z)$. Finally, we make use of the angular power spectrum estimates obtained by the QML method to derive constraints on the dark energy critical density in a flat $\Lambda$CDM model by different likelihood prescriptions. When using just the cross-correlation between WMAP7 and NVSS maps with 1.8° resolution, we show that $\Omega_\Lambda$ is about the 70\% of the total energy density, disfavouring an Einstein-de Sitter Universe at more than 2 $\sigma$ CL (confidence level).
Resumo:
The mass estimation of galaxy clusters is a crucial point for modern cosmology, and can be obtained by several different techniques. In this work we discuss a new method to measure the mass of galaxy clusters connecting the gravitational potential of the cluster with the kinematical properties of its surroundings. We explore the dynamics of the structures located in the region outside virialized cluster, We identify groups of galaxies, as sheets or filaments, in the cluster outer region, and model how the cluster gravitational potential perturbs the motion of these structures from the Hubble fow. This identification is done in the redshift space where we look for overdensities with a filamentary shape. Then we use a radial mean velocity profile that has been found as a quite universal trend in simulations, and we fit the radial infall velocity profile of the overdensities found. The method has been tested on several cluster-size haloes from cosmological N-body simulations giving results in very good agreement with the true values of virial masses of the haloes and orientation of the sheets. We then applied the method to the Coma cluster and even in this case we found a good correspondence with previous. It is possible to notice a mass discrepancy between sheets with different alignments respect to the center of the cluster. This difference can be used to reproduce the shape of the cluster, and to demonstrate that the spherical symmetry is not always a valid assumption. In fact, if the cluster is not spherical, sheets oriented along different axes should feel a slightly different gravitational potential, and so give different masses as result of the analysis described before. Even this estimation has been tested on cosmological simulations and then applied to Coma, showing the actual non-sphericity of this cluster.
Resumo:
The last two decades have seen intense scientific and regulatory interest in the health effects of particulate matter (PM). Influential epidemiological studies that characterize chronic exposure of individuals rely on monitoring data that are sparse in space and time, so they often assign the same exposure to participants in large geographic areas and across time. We estimate monthly PM during 1988-2002 in a large spatial domain for use in studying health effects in the Nurses' Health Study. We develop a conceptually simple spatio-temporal model that uses a rich set of covariates. The model is used to estimate concentrations of PM10 for the full time period and PM2.5 for a subset of the period. For the earlier part of the period, 1988-1998, few PM2.5 monitors were operating, so we develop a simple extension to the model that represents PM2.5 conditionally on PM10 model predictions. In the epidemiological analysis, model predictions of PM10 are more strongly associated with health effects than when using simpler approaches to estimate exposure. Our modeling approach supports the application in estimating both fine-scale and large-scale spatial heterogeneity and capturing space-time interaction through the use of monthly-varying spatial surfaces. At the same time, the model is computationally feasible, implementable with standard software, and readily understandable to the scientific audience. Despite simplifying assumptions, the model has good predictive performance and uncertainty characterization.
Resumo:
Secondary forests in the Lower Mekong Basin (LMB) are increasingly recognized as a valuable component of land cover, providing ecosystem services and benefits for local users. A large proportion of secondary forests in the LMB, especially in the uplands, are maintained by swidden cultivation. In order to assess the regional-scale status and dynamic trends of secondary forests in the LMB, an analysis of existing regional land cover data for 1993 and 1997 was carried out and forms the basis of this paper. To gain insight into the full range of dynamics affecting secondary forests beyond net-change rates, cross-tabulation matrix analyses were performed. The investigations revealed that secondary forests make up the largest share of forest cover in the LMB, with over 80% located in Laos and Cambodia. The deforestation rates for secondary forests are 3 times higher than the rates for other forest categories and account for two-thirds of the total deforestation. These dynamics are particularly pronounced in the less advanced countries of the LMB, especially in Laos, where national policies and the opening up of national economies seem to be the main drivers of further degradation and loss of secondary forests.
Resumo:
Understanding natural climate variability and its driving factors is crucial to assessing future climate change. Therefore, comparing proxy-based climate reconstructions with forcing factors as well as comparing these with paleoclimate model simulations is key to gaining insights into the relative roles of internal versus forced variability. A review of the state of modelling of the climate of the last millennium prior to the CMIP5–PMIP3 (Coupled Model Intercomparison Project Phase 5–Paleoclimate Modelling Intercomparison Project Phase 3) coordinated effort is presented and compared to the available temperature reconstructions. Simulations and reconstructions broadly agree on reproducing the major temperature changes and suggest an overall linear response to external forcing on multidecadal or longer timescales. Internal variability is found to have an important influence at hemispheric and global scales. The spatial distribution of simulated temperature changes during the transition from the Medieval Climate Anomaly to the Little Ice Age disagrees with that found in the reconstructions. Thus, either internal variability is a possible major player in shaping temperature changes through the millennium or the model simulations have problems realistically representing the response pattern to external forcing. A last millennium transient climate response (LMTCR) is defined to provide a quantitative framework for analysing the consistency between simulated and reconstructed climate. Beyond an overall agreement between simulated and reconstructed LMTCR ranges, this analysis is able to single out specific discrepancies between some reconstructions and the ensemble of simulations. The disagreement is found in the cases where the reconstructions show reduced covariability with external forcings or when they present high rates of temperature change.