13 resultados para Evolutionary clustering

em AMS Tesi di Dottorato - Alm@DL - Università di Bologna


Relevância:

30.00% 30.00%

Publicador:

Resumo:

Although ability to digest lactose generally declines after weaning in all mammals, in some human populations it persists also in adult individuals, a condition named lactase persistence (LP). Studies on the prevalence of the LP phenotype in worldwide human populations have shown that the frequency of this trait is highly variable in different ethnic groups, appearing to be positively correlated with the importance of milk in the diet. In particular, several single-nucleotide polymorphisms (SNPs) in the proximity of the LCT gene have been proved to be associated with LP. Nevertheless, few studies have till now analyzed genetic variation underlying LP in a wide set of Eurasian populations and, especially, in the Italian one. In the present study, we thus typed 40 SNPs surrounding the LCT gene in more than 1,000 samples from Italian and Arabic peninsulas to investigate patterns of LP-related genetic diversity in two regions which have played a pivotal role in the recent human evolutionary history according to their geographical position and historical/archaeological records. Our results underline a high and complex variability of the explored genomic region in both studied populations. In particular, a clear diversification of Northern Italian groups from the rest of the peninsula, was observed, with the formers being genetically more similar to Northern European populations than to Southern Italians. These observation are consistent with known decreasing pattern of LP from Northern to Southern Italy and suggest the possibility of an independent evolution of LP-associated genotypes in Northern Italy. A similar scenario was observed in the Arabian peninsula, with Dhofari Arabs from Southern Oman and Yemeni clustering together with respect to Arabs from Northern Oman and the subgroup of Omanis of Asian origin which appeared instead to be genetically closer to Europeans than to the rest of Arabic groups.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The present work proposes a method based on CLV (Clustering around Latent Variables) for identifying groups of consumers in L-shape data. This kind of datastructure is very common in consumer studies where a panel of consumers is asked to assess the global liking of a certain number of products and then, preference scores are arranged in a two-way table Y. External information on both products (physicalchemical description or sensory attributes) and consumers (socio-demographic background, purchase behaviours or consumption habits) may be available in a row descriptor matrix X and in a column descriptor matrix Z respectively. The aim of this method is to automatically provide a consumer segmentation where all the three matrices play an active role in the classification, getting homogeneous groups from all points of view: preference, products and consumer characteristics. The proposed clustering method is illustrated on data from preference studies on food products: juices based on berry fruits and traditional cheeses from Trentino. The hedonic ratings given by the consumer panel on the products under study were explained with respect to the product chemical compounds, sensory evaluation and consumer socio-demographic information, purchase behaviour and consumption habits.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The Peer-to-Peer network paradigm is drawing the attention of both final users and researchers for its features. P2P networks shift from the classic client-server approach to a high level of decentralization where there is no central control and all the nodes should be able not only to require services, but to provide them to other peers as well. While on one hand such high level of decentralization might lead to interesting properties like scalability and fault tolerance, on the other hand it implies many new problems to deal with. A key feature of many P2P systems is openness, meaning that everybody is potentially able to join a network with no need for subscription or payment systems. The combination of openness and lack of central control makes it feasible for a user to free-ride, that is to increase its own benefit by using services without allocating resources to satisfy other peers’ requests. One of the main goals when designing a P2P system is therefore to achieve cooperation between users. Given the nature of P2P systems based on simple local interactions of many peers having partial knowledge of the whole system, an interesting way to achieve desired properties on a system scale might consist in obtaining them as emergent properties of the many interactions occurring at local node level. Two methods are typically used to face the problem of cooperation in P2P networks: 1) engineering emergent properties when designing the protocol; 2) study the system as a game and apply Game Theory techniques, especially to find Nash Equilibria in the game and to reach them making the system stable against possible deviant behaviors. In this work we present an evolutionary framework to enforce cooperative behaviour in P2P networks that is alternative to both the methods mentioned above. Our approach is based on an evolutionary algorithm inspired by computational sociology and evolutionary game theory, consisting in having each peer periodically trying to copy another peer which is performing better. The proposed algorithms, called SLAC and SLACER, draw inspiration from tag systems originated in computational sociology, the main idea behind the algorithm consists in having low performance nodes copying high performance ones. The algorithm is run locally by every node and leads to an evolution of the network both from the topology and from the nodes’ strategy point of view. Initial tests with a simple Prisoners’ Dilemma application show how SLAC is able to bring the network to a state of high cooperation independently from the initial network conditions. Interesting results are obtained when studying the effect of cheating nodes on SLAC algorithm. In fact in some cases selfish nodes rationally exploiting the system for their own benefit can actually improve system performance from the cooperation formation point of view. The final step is to apply our results to more realistic scenarios. We put our efforts in studying and improving the BitTorrent protocol. BitTorrent was chosen not only for its popularity but because it has many points in common with SLAC and SLACER algorithms, ranging from the game theoretical inspiration (tit-for-tat-like mechanism) to the swarms topology. We discovered fairness, meant as ratio between uploaded and downloaded data, to be a weakness of the original BitTorrent protocol and we drew inspiration from the knowledge of cooperation formation and maintenance mechanism derived from the development and analysis of SLAC and SLACER, to improve fairness and tackle freeriding and cheating in BitTorrent. We produced an extension of BitTorrent called BitFair that has been evaluated through simulation and has shown the abilities of enforcing fairness and tackling free-riding and cheating nodes.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The vast majority of known proteins have not yet been experimentally characterized and little is known about their function. The design and implementation of computational tools can provide insight into the function of proteins based on their sequence, their structure, their evolutionary history and their association with other proteins. Knowledge of the three-dimensional (3D) structure of a protein can lead to a deep understanding of its mode of action and interaction, but currently the structures of <1% of sequences have been experimentally solved. For this reason, it became urgent to develop new methods that are able to computationally extract relevant information from protein sequence and structure. The starting point of my work has been the study of the properties of contacts between protein residues, since they constrain protein folding and characterize different protein structures. Prediction of residue contacts in proteins is an interesting problem whose solution may be useful in protein folding recognition and de novo design. The prediction of these contacts requires the study of the protein inter-residue distances related to the specific type of amino acid pair that are encoded in the so-called contact map. An interesting new way of analyzing those structures came out when network studies were introduced, with pivotal papers demonstrating that protein contact networks also exhibit small-world behavior. In order to highlight constraints for the prediction of protein contact maps and for applications in the field of protein structure prediction and/or reconstruction from experimentally determined contact maps, I studied to which extent the characteristic path length and clustering coefficient of the protein contacts network are values that reveal characteristic features of protein contact maps. Provided that residue contacts are known for a protein sequence, the major features of its 3D structure could be deduced by combining this knowledge with correctly predicted motifs of secondary structure. In the second part of my work I focused on a particular protein structural motif, the coiled-coil, known to mediate a variety of fundamental biological interactions. Coiled-coils are found in a variety of structural forms and in a wide range of proteins including, for example, small units such as leucine zippers that drive the dimerization of many transcription factors or more complex structures such as the family of viral proteins responsible for virus-host membrane fusion. The coiled-coil structural motif is estimated to account for 5-10% of the protein sequences in the various genomes. Given their biological importance, in my work I introduced a Hidden Markov Model (HMM) that exploits the evolutionary information derived from multiple sequence alignments, to predict coiled-coil regions and to discriminate coiled-coil sequences. The results indicate that the new HMM outperforms all the existing programs and can be adopted for the coiled-coil prediction and for large-scale genome annotation. Genome annotation is a key issue in modern computational biology, being the starting point towards the understanding of the complex processes involved in biological networks. The rapid growth in the number of protein sequences and structures available poses new fundamental problems that still deserve an interpretation. Nevertheless, these data are at the basis of the design of new strategies for tackling problems such as the prediction of protein structure and function. Experimental determination of the functions of all these proteins would be a hugely time-consuming and costly task and, in most instances, has not been carried out. As an example, currently, approximately only 20% of annotated proteins in the Homo sapiens genome have been experimentally characterized. A commonly adopted procedure for annotating protein sequences relies on the "inheritance through homology" based on the notion that similar sequences share similar functions and structures. This procedure consists in the assignment of sequences to a specific group of functionally related sequences which had been grouped through clustering techniques. The clustering procedure is based on suitable similarity rules, since predicting protein structure and function from sequence largely depends on the value of sequence identity. However, additional levels of complexity are due to multi-domain proteins, to proteins that share common domains but that do not necessarily share the same function, to the finding that different combinations of shared domains can lead to different biological roles. In the last part of this study I developed and validate a system that contributes to sequence annotation by taking advantage of a validated transfer through inheritance procedure of the molecular functions and of the structural templates. After a cross-genome comparison with the BLAST program, clusters were built on the basis of two stringent constraints on sequence identity and coverage of the alignment. The adopted measure explicity answers to the problem of multi-domain proteins annotation and allows a fine grain division of the whole set of proteomes used, that ensures cluster homogeneity in terms of sequence length. A high level of coverage of structure templates on the length of protein sequences within clusters ensures that multi-domain proteins when present can be templates for sequences of similar length. This annotation procedure includes the possibility of reliably transferring statistically validated functions and structures to sequences considering information available in the present data bases of molecular functions and structures.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In the recent years TNFRSF13B coding variants have been implicated by clinical genetics studies in Common Variable Immunodeficiency (CVID), the most common clinically relevant primary immunodeficiency in individuals of European ancestry, but their functional effects in relation to the development of the disease have not been entirely established. To examine the potential contribution of such variants to CVID, the more comprehensive perspective of an evolutionary approach was applied in this study, underling the belief that evolutionary genetics methods can play a role in dissecting the origin, causes and diffusion of human diseases, representing a powerful tool also in human health research. For this purpose, TNFRSF13B coding region was sequenced in 451 healthy individuals belonging to 26 worldwide populations, in addition to 96 control, 77 CVID and 38 Selective IgA Deficiency (IgAD) individuals from Italy, leading to the first achievement of a global picture of TNFRSF13B nucleotide diversity and haplotype structure and making suggestion of its evolutionary history possible. A slow rate of evolution, within our species and when compared to the chimpanzee, low levels of genetic diversity geographical structure and the absence of recent population specific selective pressures were observed for the examined genomic region, suggesting that geographical distribution of its variability is more plausibly related to its involvement also in innate immunity rather than in adaptive immunity only. This, together with the extremely subtle disease/healthy samples differences observed, suggests that CVID might be more likely related to still unknown environmental and genetic factors, rather than to the nature of TNFRSF13B variants only.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The intensity of regional specialization in specific activities, and conversely, the level of industrial concentration in specific locations, has been used as a complementary evidence for the existence and significance of externalities. Additionally, economists have mainly focused the debate on disentangling the sources of specialization and concentration processes according to three vectors: natural advantages, internal, and external scale economies. The arbitrariness of partitions plays a key role in capturing these effects, while the selection of the partition would have to reflect the actual characteristics of the economy. Thus, the identification of spatial boundaries to measure specialization becomes critical, since most likely the model will be adapted to different scales of distance, and be influenced by different types of externalities or economies of agglomeration, which are based on the mechanisms of interaction with particular requirements of spatial proximity. This work is based on the analysis of the spatial aspect of economic specialization supported by the manufacturing industry case. The main objective is to propose, for discrete and continuous space: i) a measure of global specialization; ii) a local disaggregation of the global measure; and iii) a spatial clustering method for the identification of specialized agglomerations.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In a large number of problems the high dimensionality of the search space, the vast number of variables and the economical constrains limit the ability of classical techniques to reach the optimum of a function, known or unknown. In this thesis we investigate the possibility to combine approaches from advanced statistics and optimization algorithms in such a way to better explore the combinatorial search space and to increase the performance of the approaches. To this purpose we propose two methods: (i) Model Based Ant Colony Design and (ii) Naïve Bayes Ant Colony Optimization. We test the performance of the two proposed solutions on a simulation study and we apply the novel techniques on an appplication in the field of Enzyme Engineering and Design.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Two Amerindian populations from the Peruvian Amazon (Yanesha) and from rural lowlands of the Argentinean Gran Chaco (Wichi) were analyzed. They represent two case study of the South American genetic variability. The Yanesha represent a model of population isolated for long-time in the Amazon rainforest, characterized by environmental and altitudinal stratifications. The Wichi represent a model of population living in an area recently colonized by European populations (the Criollos are the population of the admixed descendents), whose aim is to depict the native ancestral gene pool and the degree of admixture, in relation to the very high prevalence of Chagas disease. The methods used for the genotyping are common, concerning the Y chromosome markers (male lineage) and the mitochondrial markers (maternal lineage). The determination of the phylogeographic diagnostic polymorphisms was carried out by the classical techniques of PCR, restriction enzymes, sequencing and specific mini-sequencing. New method for the detection of the protozoa Trypanosoma cruzi was developed by means of the nested PCR. The main results show patterns of genetic stratification in Yanesha forest communities, referable to different migrations at different times, estimated by Bayesian analyses. In particular Yanesha were considered as a population of transition between the Amazon basin and the Andean Cordillera, evaluating the potential migration routes and the separation of clusters of community in relation to different genetic bio-ancestry. As the Wichi, the gene pool analyzed appears clearly differentiated by the admixed sympatric Criollos, due to strict social practices (deeply analyzed with the support of cultural anthropological tools) that have preserved the native identity at a diachronic level. A pattern of distribution of the seropositivity in relation to the different phylogenetic lineages (the adaptation in evolutionary terms) does not appear, neither Amerindian nor European, but in relation to environmental and living conditions of the two distinct subpopulations.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

There are different ways to do cluster analysis of categorical data in the literature and the choice among them is strongly related to the aim of the researcher, if we do not take into account time and economical constraints. Main approaches for clustering are usually distinguished into model-based and distance-based methods: the former assume that objects belonging to the same class are similar in the sense that their observed values come from the same probability distribution, whose parameters are unknown and need to be estimated; the latter evaluate distances among objects by a defined dissimilarity measure and, basing on it, allocate units to the closest group. In clustering, one may be interested in the classification of similar objects into groups, and one may be interested in finding observations that come from the same true homogeneous distribution. But do both of these aims lead to the same clustering? And how good are clustering methods designed to fulfil one of these aims in terms of the other? In order to answer, two approaches, namely a latent class model (mixture of multinomial distributions) and a partition around medoids one, are evaluated and compared by Adjusted Rand Index, Average Silhouette Width and Pearson-Gamma indexes in a fairly wide simulation study. Simulation outcomes are plotted in bi-dimensional graphs via Multidimensional Scaling; size of points is proportional to the number of points that overlap and different colours are used according to the cluster membership.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Epigenetic variability is a new mechanism for the study of human microevolution, because it creates both phenotypic diversity within an individual and within population. This mechanism constitutes an important reservoir for adaptation in response to new stimuli and recent studies have demonstrated that selective pressures shape not only the genetic code but also DNA methylation profiles. The aim of this thesis is the study of the role of DNA methylation changes in human adaptive processes, considering the Italian peninsula and macro-geographical areas. A whole-genome analysis of DNA methylation profile across the Italian penisula identified some genes whose methylation levels differ between individuals of different Italian districts (South, Centre and North of Italy). These genes are involved in nitrogen compound metabolism and genes involved in pathogens response. Considering individuals with different macro-geographical origins (individuals of Asians, European and African ancestry) more significant DMRs (differentially methylated regions) were identified and are located in genes involved in glucoronidation, in immune response as well as in cell comunication processes. A "profile" of each ancestry (African, Asian and European) was described. Moreover a deepen analysis of three candidate genes (KRTCAP3, MAD1L and BRSK2) in a cohort of individuals of different countries (Morocco, Nigeria, China and Philippines) living in Bologna, was performed in order to explore genetic and epigenetic diversity. Moreover this thesis have paved the way for the application of DNA methylation for the study of hystorical remains and in particular for the age-estimation of individuals starting from biological samples (such as teeth or blood). Noteworthy, a mathematical model that considered methylation values of DNA extracted from cementum and pulp of living individuals can estimate chronological age with high accuracy (median absolute difference between age estimated from DNA methylation and chronological age was 1.2 years).