13 resultados para Similarity measure

em AMS Tesi di Dottorato - Alm@DL - Università di Bologna


Relevância:

60.00% 60.00%

Publicador:

Resumo:

In many application domains data can be naturally represented as graphs. When the application of analytical solutions for a given problem is unfeasible, machine learning techniques could be a viable way to solve the problem. Classical machine learning techniques are defined for data represented in a vectorial form. Recently some of them have been extended to deal directly with structured data. Among those techniques, kernel methods have shown promising results both from the computational complexity and the predictive performance point of view. Kernel methods allow to avoid an explicit mapping in a vectorial form relying on kernel functions, which informally are functions calculating a similarity measure between two entities. However, the definition of good kernels for graphs is a challenging problem because of the difficulty to find a good tradeoff between computational complexity and expressiveness. Another problem we face is learning on data streams, where a potentially unbounded sequence of data is generated by some sources. There are three main contributions in this thesis. The first contribution is the definition of a new family of kernels for graphs based on Directed Acyclic Graphs (DAGs). We analyzed two kernels from this family, achieving state-of-the-art results from both the computational and the classification point of view on real-world datasets. The second contribution consists in making the application of learning algorithms for streams of graphs feasible. Moreover,we defined a principled way for the memory management. The third contribution is the application of machine learning techniques for structured data to non-coding RNA function prediction. In this setting, the secondary structure is thought to carry relevant information. However, existing methods considering the secondary structure have prohibitively high computational complexity. We propose to apply kernel methods on this domain, obtaining state-of-the-art results.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

I have studied entropy profiles obtained in a sample of 24 X-ray objects at high redshift retrieved from the Chandra archive. I have discussed the scaling properties of the entropy S, the correlation between metallicity Z and S, the profiles of the temperature of the gas, Tgas, and performed a comparison between the dark matter 'temperature' and Tgas in order to constrain the non-gravitational processes which affect the thermal history of the gas. Furthermore I have studied the scaling relations between the X-ray quantities and Sunyaev Zel'dovich measurements. I have observed that X-ray laws are steeper than the relations predicted from the adiabatic model. These deviations from expectations based on self-similarity are usually interpreted in terms of feedback processes leading to non-gravitational gas heating, and suggesting a scenario in which the ICM at higher redshift has lower both X-ray luminosity and pressure in the central regions than the expectations from self-similar model. I have also investigated a Bayesian X-ray and Sunyaev Zel'dovich analysis, which allows to study the external regions of the clusters well beyond the volumes resolved with X-ray observations (1/3-1/2 of the virial radius), to measure the deprojected physical cluster properties, like temperature, density, entropy, gas mass and total mass up to the virial radius.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The vast majority of known proteins have not yet been experimentally characterized and little is known about their function. The design and implementation of computational tools can provide insight into the function of proteins based on their sequence, their structure, their evolutionary history and their association with other proteins. Knowledge of the three-dimensional (3D) structure of a protein can lead to a deep understanding of its mode of action and interaction, but currently the structures of <1% of sequences have been experimentally solved. For this reason, it became urgent to develop new methods that are able to computationally extract relevant information from protein sequence and structure. The starting point of my work has been the study of the properties of contacts between protein residues, since they constrain protein folding and characterize different protein structures. Prediction of residue contacts in proteins is an interesting problem whose solution may be useful in protein folding recognition and de novo design. The prediction of these contacts requires the study of the protein inter-residue distances related to the specific type of amino acid pair that are encoded in the so-called contact map. An interesting new way of analyzing those structures came out when network studies were introduced, with pivotal papers demonstrating that protein contact networks also exhibit small-world behavior. In order to highlight constraints for the prediction of protein contact maps and for applications in the field of protein structure prediction and/or reconstruction from experimentally determined contact maps, I studied to which extent the characteristic path length and clustering coefficient of the protein contacts network are values that reveal characteristic features of protein contact maps. Provided that residue contacts are known for a protein sequence, the major features of its 3D structure could be deduced by combining this knowledge with correctly predicted motifs of secondary structure. In the second part of my work I focused on a particular protein structural motif, the coiled-coil, known to mediate a variety of fundamental biological interactions. Coiled-coils are found in a variety of structural forms and in a wide range of proteins including, for example, small units such as leucine zippers that drive the dimerization of many transcription factors or more complex structures such as the family of viral proteins responsible for virus-host membrane fusion. The coiled-coil structural motif is estimated to account for 5-10% of the protein sequences in the various genomes. Given their biological importance, in my work I introduced a Hidden Markov Model (HMM) that exploits the evolutionary information derived from multiple sequence alignments, to predict coiled-coil regions and to discriminate coiled-coil sequences. The results indicate that the new HMM outperforms all the existing programs and can be adopted for the coiled-coil prediction and for large-scale genome annotation. Genome annotation is a key issue in modern computational biology, being the starting point towards the understanding of the complex processes involved in biological networks. The rapid growth in the number of protein sequences and structures available poses new fundamental problems that still deserve an interpretation. Nevertheless, these data are at the basis of the design of new strategies for tackling problems such as the prediction of protein structure and function. Experimental determination of the functions of all these proteins would be a hugely time-consuming and costly task and, in most instances, has not been carried out. As an example, currently, approximately only 20% of annotated proteins in the Homo sapiens genome have been experimentally characterized. A commonly adopted procedure for annotating protein sequences relies on the "inheritance through homology" based on the notion that similar sequences share similar functions and structures. This procedure consists in the assignment of sequences to a specific group of functionally related sequences which had been grouped through clustering techniques. The clustering procedure is based on suitable similarity rules, since predicting protein structure and function from sequence largely depends on the value of sequence identity. However, additional levels of complexity are due to multi-domain proteins, to proteins that share common domains but that do not necessarily share the same function, to the finding that different combinations of shared domains can lead to different biological roles. In the last part of this study I developed and validate a system that contributes to sequence annotation by taking advantage of a validated transfer through inheritance procedure of the molecular functions and of the structural templates. After a cross-genome comparison with the BLAST program, clusters were built on the basis of two stringent constraints on sequence identity and coverage of the alignment. The adopted measure explicity answers to the problem of multi-domain proteins annotation and allows a fine grain division of the whole set of proteomes used, that ensures cluster homogeneity in terms of sequence length. A high level of coverage of structure templates on the length of protein sequences within clusters ensures that multi-domain proteins when present can be templates for sequences of similar length. This annotation procedure includes the possibility of reliably transferring statistically validated functions and structures to sequences considering information available in the present data bases of molecular functions and structures.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The intensity of regional specialization in specific activities, and conversely, the level of industrial concentration in specific locations, has been used as a complementary evidence for the existence and significance of externalities. Additionally, economists have mainly focused the debate on disentangling the sources of specialization and concentration processes according to three vectors: natural advantages, internal, and external scale economies. The arbitrariness of partitions plays a key role in capturing these effects, while the selection of the partition would have to reflect the actual characteristics of the economy. Thus, the identification of spatial boundaries to measure specialization becomes critical, since most likely the model will be adapted to different scales of distance, and be influenced by different types of externalities or economies of agglomeration, which are based on the mechanisms of interaction with particular requirements of spatial proximity. This work is based on the analysis of the spatial aspect of economic specialization supported by the manufacturing industry case. The main objective is to propose, for discrete and continuous space: i) a measure of global specialization; ii) a local disaggregation of the global measure; and iii) a spatial clustering method for the identification of specialized agglomerations.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The aim of the thesi is to formulate a suitable Item Response Theory (IRT) based model to measure HRQoL (as latent variable) using a mixed responses questionnaire and relaxing the hypothesis of normal distributed latent variable. The new model is a combination of two models already presented in literature, that is, a latent trait model for mixed responses and an IRT model for Skew Normal latent variable. It is developed in a Bayesian framework, a Markov chain Monte Carlo procedure is used to generate samples of the posterior distribution of the parameters of interest. The proposed model is test on a questionnaire composed by 5 discrete items and one continuous to measure HRQoL in children, the EQ-5D-Y questionnaire. A large sample of children collected in the schools was used. In comparison with a model for only discrete responses and a model for mixed responses and normal latent variable, the new model has better performances, in term of deviance information criterion (DIC), chain convergences times and precision of the estimates.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

La tesi si occupa della traduzione di Measure for Measure di Shakespeare scritta da Cesare Garboli e pubblicata nel 1992 con Einaudi nella collana «Scrittori tradotti da scrittori». La traduzione fu concepita per il Teatro Stabile di Torino diretto da Luca Ronconi, che debuttò al teatro Carignano nel 1992 e venne successivamente ripresa, con alcune varianti, dalla compagnia di Carlo Cecchi nel 1998, per una nuova messinscena al teatro Garibaldi di Palermo. A partire dagli esiti più recenti dei Translation Studies, il lavoro sviluppa uno studio comparato, dal punto di vista linguistico e sotto il profilo ermeneutico, fra la traduzione di Garboli, il testo originale nelle due edizioni Arden e Cambridge e le traduzioni italiane di Measure for Measure pubblicate nel Novecento. La parte finale della tesi è dedicata alle messinscene a Torino e a Palermo: un confronto per evidenziare gli elementi che in entrambe appartengono alla strutturazione del testo tradotto e i caratteri specifici degli universi di finzione raffigurati dai due registi.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The importance of organizational issues to assess the success of international development project has not been fully considered yet. After a brief overview, in 1st chapter, on main actors involved on international cooperation, in the 2nd chapter an analysis of the literature on the project success definition, focused on the success criteria and success factors, was carried out by surveying the contribution of different authors and approaches. Traditionally projects were perceived as successful when they met time, budget and performance goals, assuming a basic similarity among projects (universalistic approach). However, starting from a non-universalistic approach, the importance of organization’s effectiveness, in terms of Relations Sustainability, emerged as a dimension able to define and assess a project success. The identification of the factors influencing the relationship between and inside the organizations becomes consequently a priority. In 3th chapter, starting from a literature survey, the different analytical approaches related to the inter and intra-organization relationships are analysed. They involve two different groups: the first includes studies focused on the type of organizations relationship structure (Supply Chains, Networks, Clusters and Industrial Districts); the second group includes approaches related to the general theories on firms relationship interpretation (Transaction Costs Economics, Resource Based View, Organization Theory). The variables and logical frameworks provided by these different theoretical contributions are compared and classified in order to find out possible connections and/or juxtapositions. Being an exhaustive collection of the literature on the subject is impossible, the main goal is to underline the existence of potentially overlapping and/or integrating approaches examining the contribution provided by different representative authors. The survey showed first of all many variables in common between approaches coming from different disciplines; furthermore the non overlapping variables can be integrated contributing to a broader picture of the variables influencing the organization relations; in particular a theoretical design for the identification of connections between the inter and the intra-organizations relations was made possible. The results obtained in 3th chapter help to defining a general theoretical framework linking the different interpretative variables. Based on extensive research contributions on the factors influencing the relations between organizations, the 4th chapter expands the analysis of the influence of variables like Human Resource Management, Organizational Climate, Psychological Contract and KSA (Knowledge, Skills, Abilities) on the relation sustainability. A detailed analysis of these relations is provided and a research hypothesis are built. According to this new framework in 5th chapter a statistical analysis was performed to qualify and quantify the influence of Organizational Climate on the Relations Sustainability. To this end the Structural Equation Modeling (SEMs) has adopted as method for the definition of the latent variables and the measure of their relations. The results obtained are satisfactory. An effective strategy to motivate the respondents to participate in the survey seems to be at the moment one of the major obstacles to the analysis implementation since the organizational performances are not specifically required by the projects’ evaluation guidelines and they represent an increase in the project related transaction costs. Their explicit introduction in the project presentation guidelines should be explored as an opportunity to increase the chances of success of these projects.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The purpose of this thesis is to investigate the strength and structure of the magnetized medium surrounding radio galaxies via observations of the Faraday effect. This study is based on an analysis of the polarization properties of radio galaxies selected to have a range of morphologies (elongated tails, or lobes with small axial ratios) and to be located in a variety of environments (from rich cluster core to small group). The targets include famous objects like M84 and M87. A key aspect of this work is the combination of accurate radio imaging with high-quality X-ray data for the gas surrounding the sources. Although the focus of this thesis is primarily observational, I developed analytical models and performed two- and three-dimensional numerical simulations of magnetic fields. The steps of the thesis are: (a) to analyze new and archival observations of Faraday rotation measure (RM) across radio galaxies and (b) to interpret these and existing RM images using sophisticated two and three-dimensional Monte Carlo simulations. The approach has been to select a few bright, very extended and highly polarized radio galaxies. This is essential to have high signal-to-noise in polarization over large enough areas to allow computation of spatial statistics such as the structure function (and hence the power spectrum) of rotation measure, which requires a large number of independent measurements. New and archival Very Large Array observations of the target sources have been analyzed in combination with high-quality X-ray data from the Chandra, XMM-Newton and ROSAT satellites. The work has been carried out by making use of: 1) Analytical predictions of the RM structure functions to quantify the RM statistics and to constrain the power spectra of the RM and magnetic field. 2) Two-dimensional Monte Carlo simulations to address the effect of an incomplete sampling of RM distribution and so to determine errors for the power spectra. 3) Methods to combine measurements of RM and depolarization in order to constrain the magnetic-field power spectrum on small scales. 4) Three-dimensional models of the group/cluster environments, including different magnetic field power spectra and gas density distributions. This thesis has shown that the magnetized medium surrounding radio galaxies appears more complicated than was apparent from earlier work. Three distinct types of magnetic-field structure are identified: an isotropic component with large-scale fluctuations, plausibly associated with the intergalactic medium not affected by the presence of a radio source; a well-ordered field draped around the front ends of the radio lobes and a field with small-scale fluctuations in rims of compressed gas surrounding the inner lobes, perhaps associated with a mixing layer.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The formation and evolution of galaxy bulges is a greatly debated topic in modern astrophysics. An approach to address this issue is to look at the Galactic bulge, the closest to us. According to some theoretical models, our bulge built-up from the merger of substructures formed from the instability and fragmentation of a proto-disk in the early phases of Galactic evolution. We may have discovered the remnant of one of these substructures: the stellar system Terzan 5. Terzan 5 hosts two stellar populations with different iron abundances, thus suggesting it once was far more massive than today. Moreover, its peculiar chemistry resembles that observed only in the Galactic bulge. In this Thesis we perform a detailed photometric and spectroscopic analysis of this cluster to determine its formation and evolutionary histories. Form the photometric point of view we built a high-resolution differential reddening map in Terzan 5 direction and we measured relative proper motions to separate its member population from the contaminating field stars. This information represents the necessary work to measure the absolute ages of Terzan 5 populations via the Turn-off luminosity method. From the spectroscopic point of view we measured abundances for more than 600 stars belonging to Terzan 5 and its surroundings in order to build the largest field-decontaminated metallicity distribution for this system. We find that the metallicity distribution is extremely wide (more than 1 dex) and we discovered a third, metal-poor and alpha-enhanced population with average [Fe/H]=-0.8. The striking similarity between Terzan 5 and the bulge in terms of their chemical formation and evolution revealed by this Thesis suggests that Terzan 5 formed in situ with the bulge itself. In particular its metal-poor populations trace the early stages of the bulge formation, while its most metal-rich component contains crucial information on the bulge more recent evolution.