12 resultados para Robust Statistics
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Resumo:
In the post genomic era with the massive production of biological data the understanding of factors affecting protein stability is one of the most important and challenging tasks for highlighting the role of mutations in relation to human maladies. The problem is at the basis of what is referred to as molecular medicine with the underlying idea that pathologies can be detailed at a molecular level. To this purpose scientific efforts focus on characterising mutations that hamper protein functions and by these affect biological processes at the basis of cell physiology. New techniques have been developed with the aim of detailing single nucleotide polymorphisms (SNPs) at large in all the human chromosomes and by this information in specific databases are exponentially increasing. Eventually mutations that can be found at the DNA level, when occurring in transcribed regions may then lead to mutated proteins and this can be a serious medical problem, largely affecting the phenotype. Bioinformatics tools are urgently needed to cope with the flood of genomic data stored in database and in order to analyse the role of SNPs at the protein level. In principle several experimental and theoretical observations are suggesting that protein stability in the solvent-protein space is responsible of the correct protein functioning. Then mutations that are found disease related during DNA analysis are often assumed to perturb protein stability as well. However so far no extensive analysis at the proteome level has investigated whether this is the case. Also computationally methods have been developed to infer whether a mutation is disease related and independently whether it affects protein stability. Therefore whether the perturbation of protein stability is related to what it is routinely referred to as a disease is still a big question mark. In this work we have tried for the first time to explore the relation among mutations at the protein level and their relevance to diseases with a large-scale computational study of the data from different databases. To this aim in the first part of the thesis for each mutation type we have derived two probabilistic indices (for 141 out of 150 possible SNPs): the perturbing index (Pp), which indicates the probability that a given mutation effects protein stability considering all the “in vitro” thermodynamic data available and the disease index (Pd), which indicates the probability of a mutation to be disease related, given all the mutations that have been clinically associated so far. We find with a robust statistics that the two indexes correlate with the exception of all the mutations that are somatic cancer related. By this each mutation of the 150 can be coded by two values that allow a direct comparison with data base information. Furthermore we also implement computational methods that starting from the protein structure is suited to predict the effect of a mutation on protein stability and find that overpasses a set of other predictors performing the same task. The predictor is based on support vector machines and takes as input protein tertiary structures. We show that the predicted data well correlate with the data from the databases. All our efforts therefore add to the SNP annotation process and more importantly found the relationship among protein stability perturbation and the human variome leading to the diseasome.
Resumo:
With the advent of new technologies it is increasingly easier to find data of different nature from even more accurate sensors that measure the most disparate physical quantities and with different methodologies. The collection of data thus becomes progressively important and takes the form of archiving, cataloging and online and offline consultation of information. Over time, the amount of data collected can become so relevant that it contains information that cannot be easily explored manually or with basic statistical techniques. The use of Big Data therefore becomes the object of more advanced investigation techniques, such as Machine Learning and Deep Learning. In this work some applications in the world of precision zootechnics and heat stress accused by dairy cows are described. Experimental Italian and German stables were involved for the training and testing of the Random Forest algorithm, obtaining a prediction of milk production depending on the microclimatic conditions of the previous days with satisfactory accuracy. Furthermore, in order to identify an objective method for identifying production drops, compared to the Wood model, typically used as an analytical model of the lactation curve, a Robust Statistics technique was used. Its application on some sample lactations and the results obtained allow us to be confident about the use of this method in the future.
Resumo:
This thesis deals with an investigation of combinatorial and robust optimisation models to solve railway problems. Railway applications represent a challenging area for operations research. In fact, most problems in this context can be modelled as combinatorial optimisation problems, in which the number of feasible solutions is finite. Yet, despite the astonishing success in the field of combinatorial optimisation, the current state of algorithmic research faces severe difficulties with highly-complex and data-intensive applications such as those dealing with optimisation issues in large-scale transportation networks. One of the main issues concerns imperfect information. The idea of Robust Optimisation, as a way to represent and handle mathematically systems with not precisely known data, dates back to 1970s. Unfortunately, none of those techniques proved to be successfully applicable in one of the most complex and largest in scale (transportation) settings: that of railway systems. Railway optimisation deals with planning and scheduling problems over several time horizons. Disturbances are inevitable and severely affect the planning process. Here we focus on two compelling aspects of planning: robust planning and online (real-time) planning.
Resumo:
We present a non linear technique to invert strong motion records with the aim of obtaining the final slip and rupture velocity distributions on the fault plane. In this thesis, the ground motion simulation is obtained evaluating the representation integral in the frequency. The Green’s tractions are computed using the discrete wave-number integration technique that provides the full wave-field in a 1D layered propagation medium. The representation integral is computed through a finite elements technique, based on a Delaunay’s triangulation on the fault plane. The rupture velocity is defined on a coarser regular grid and rupture times are computed by integration of the eikonal equation. For the inversion, the slip distribution is parameterized by 2D overlapping Gaussian functions, which can easily relate the spectrum of the possible solutions with the minimum resolvable wavelength, related to source-station distribution and data processing. The inverse problem is solved by a two-step procedure aimed at separating the computation of the rupture velocity from the evaluation of the slip distribution, the latter being a linear problem, when the rupture velocity is fixed. The non-linear step is solved by optimization of an L2 misfit function between synthetic and real seismograms, and solution is searched by the use of the Neighbourhood Algorithm. The conjugate gradient method is used to solve the linear step instead. The developed methodology has been applied to the M7.2, Iwate Nairiku Miyagi, Japan, earthquake. The estimated magnitude seismic moment is 2.6326 dyne∙cm that corresponds to a moment magnitude MW 6.9 while the mean the rupture velocity is 2.0 km/s. A large slip patch extends from the hypocenter to the southern shallow part of the fault plane. A second relatively large slip patch is found in the northern shallow part. Finally, we gave a quantitative estimation of errors associates with the parameters.
Resumo:
3D video-fluoroscopy is an accurate but cumbersome technique to estimate natural or prosthetic human joint kinematics. This dissertation proposes innovative methodologies to improve the 3D fluoroscopic analysis reliability and usability. Being based on direct radiographic imaging of the joint, and avoiding soft tissue artefact that limits the accuracy of skin marker based techniques, the fluoroscopic analysis has a potential accuracy of the order of mm/deg or better. It can provide fundamental informations for clinical and methodological applications, but, notwithstanding the number of methodological protocols proposed in the literature, time consuming user interaction is exploited to obtain consistent results. The user-dependency prevented a reliable quantification of the actual accuracy and precision of the methods, and, consequently, slowed down the translation to the clinical practice. The objective of the present work was to speed up this process introducing methodological improvements in the analysis. In the thesis, the fluoroscopic analysis was characterized in depth, in order to evaluate its pros and cons, and to provide reliable solutions to overcome its limitations. To this aim, an analytical approach was followed. The major sources of error were isolated with in-silico preliminary studies as: (a) geometric distortion and calibration errors, (b) 2D images and 3D models resolutions, (c) incorrect contour extraction, (d) bone model symmetries, (e) optimization algorithm limitations, (f) user errors. The effect of each criticality was quantified, and verified with an in-vivo preliminary study on the elbow joint. The dominant source of error was identified in the limited extent of the convergence domain for the local optimization algorithms, which forced the user to manually specify the starting pose for the estimating process. To solve this problem, two different approaches were followed: to increase the optimal pose convergence basin, the local approach used sequential alignments of the 6 degrees of freedom in order of sensitivity, or a geometrical feature-based estimation of the initial conditions for the optimization; the global approach used an unsupervised memetic algorithm to optimally explore the search domain. The performances of the technique were evaluated with a series of in-silico studies and validated in-vitro with a phantom based comparison with a radiostereometric gold-standard. The accuracy of the method is joint-dependent, and for the intact knee joint, the new unsupervised algorithm guaranteed a maximum error lower than 0.5 mm for in-plane translations, 10 mm for out-of-plane translation, and of 3 deg for rotations in a mono-planar setup; and lower than 0.5 mm for translations and 1 deg for rotations in a bi-planar setups. The bi-planar setup is best suited when accurate results are needed, such as for methodological research studies. The mono-planar analysis may be enough for clinical application when the analysis time and cost may be an issue. A further reduction of the user interaction was obtained for prosthetic joints kinematics. A mixed region-growing and level-set segmentation method was proposed and halved the analysis time, delegating the computational burden to the machine. In-silico and in-vivo studies demonstrated that the reliability of the new semiautomatic method was comparable to a user defined manual gold-standard. The improved fluoroscopic analysis was finally applied to a first in-vivo methodological study on the foot kinematics. Preliminary evaluations showed that the presented methodology represents a feasible gold-standard for the validation of skin marker based foot kinematics protocols.
Resumo:
The main scope of my PhD is the reconstruction of the large-scale bivalve phylogeny on the basis of four mitochondrial genes, with samples taken from all major groups of the class. To my knowledge, it is the first attempt of such a breadth in Bivalvia. I decided to focus on both ribosomal and protein coding DNA sequences (two ribosomal encoding genes -12s and 16s -, and two protein coding ones - cytochrome c oxidase I and cytochrome b), since either bibliography and my preliminary results confirmed the importance of combined gene signals in improving evolutionary pathways of the group. Moreover, I wanted to propose a methodological pipeline that proved to be useful to obtain robust results in bivalves phylogeny. Actually, best-performing taxon sampling and alignment strategies were tested, and several data partitioning and molecular evolution models were analyzed, thus demonstrating the importance of molding and implementing non-trivial evolutionary models. In the line of a more rigorous approach to data analysis, I also proposed a new method to assess taxon sampling, by developing Clarke and Warwick statistics: taxon sampling is a major concern in phylogenetic studies, and incomplete, biased, or improper taxon assemblies can lead to misleading results in reconstructing evolutionary trees. Theoretical methods are already available to optimize taxon choice in phylogenetic analyses, but most involve some knowledge about genetic relationships of the group of interest, or even a well-established phylogeny itself; these data are not always available in general phylogenetic applications. The method I proposed measures the "phylogenetic representativeness" of a given sample or set of samples and it is based entirely on the pre-existing available taxonomy of the ingroup, which is commonly known to investigators. Moreover, it also accounts for instability and discordance in taxonomies. A Python-based script suite, called PhyRe, has been developed to implement all analyses.
Resumo:
Throughout the twentieth century statistical methods have increasingly become part of experimental research. In particular, statistics has made quantification processes meaningful in the soft sciences, which had traditionally relied on activities such as collecting and describing diversity rather than timing variation. The thesis explores this change in relation to agriculture and biology, focusing on analysis of variance and experimental design, the statistical methods developed by the mathematician and geneticist Ronald Aylmer Fisher during the 1920s. The role that Fisher’s methods acquired as tools of scientific research, side by side with the laboratory equipment and the field practices adopted by research workers, is here investigated bottom-up, beginning with the computing instruments and the information technologies that were the tools of the trade for statisticians. Four case studies show under several perspectives the interaction of statistics, computing and information technologies, giving on the one hand an overview of the main tools – mechanical calculators, statistical tables, punched and index cards, standardised forms, digital computers – adopted in the period, and on the other pointing out how these tools complemented each other and were instrumental for the development and dissemination of analysis of variance and experimental design. The period considered is the half-century from the early 1920s to the late 1960s, the institutions investigated are Rothamsted Experimental Station and the Galton Laboratory, and the statisticians examined are Ronald Fisher and Frank Yates.
Resumo:
It is usual to hear a strange short sentence: «Random is better than...». Why is randomness a good solution to a certain engineering problem? There are many possible answers, and all of them are related to the considered topic. In this thesis I will discuss about two crucial topics that take advantage by randomizing some waveforms involved in signals manipulations. In particular, advantages are guaranteed by shaping the second order statistic of antipodal sequences involved in an intermediate signal processing stages. The first topic is in the area of analog-to-digital conversion, and it is named Compressive Sensing (CS). CS is a novel paradigm in signal processing that tries to merge signal acquisition and compression at the same time. Consequently it allows to direct acquire a signal in a compressed form. In this thesis, after an ample description of the CS methodology and its related architectures, I will present a new approach that tries to achieve high compression by design the second order statistics of a set of additional waveforms involved in the signal acquisition/compression stage. The second topic addressed in this thesis is in the area of communication system, in particular I focused the attention on ultra-wideband (UWB) systems. An option to produce and decode UWB signals is direct-sequence spreading with multiple access based on code division (DS-CDMA). Focusing on this methodology, I will address the coexistence of a DS-CDMA system with a narrowband interferer. To do so, I minimize the joint effect of both multiple access (MAI) and narrowband (NBI) interference on a simple matched filter receiver. I will show that, when spreading sequence statistical properties are suitably designed, performance improvements are possible with respect to a system exploiting chaos-based sequences minimizing MAI only.
Resumo:
Bioinformatics, in the last few decades, has played a fundamental role to give sense to the huge amount of data produced. Obtained the complete sequence of a genome, the major problem of knowing as much as possible of its coding regions, is crucial. Protein sequence annotation is challenging and, due to the size of the problem, only computational approaches can provide a feasible solution. As it has been recently pointed out by the Critical Assessment of Function Annotations (CAFA), most accurate methods are those based on the transfer-by-homology approach and the most incisive contribution is given by cross-genome comparisons. In the present thesis it is described a non-hierarchical sequence clustering method for protein automatic large-scale annotation, called “The Bologna Annotation Resource Plus” (BAR+). The method is based on an all-against-all alignment of more than 13 millions protein sequences characterized by a very stringent metric. BAR+ can safely transfer functional features (Gene Ontology and Pfam terms) inside clusters by means of a statistical validation, even in the case of multi-domain proteins. Within BAR+ clusters it is also possible to transfer the three dimensional structure (when a template is available). This is possible by the way of cluster-specific HMM profiles that can be used to calculate reliable template-to-target alignments even in the case of distantly related proteins (sequence identity < 30%). Other BAR+ based applications have been developed during my doctorate including the prediction of Magnesium binding sites in human proteins, the ABC transporters superfamily classification and the functional prediction (GO terms) of the CAFA targets. Remarkably, in the CAFA assessment, BAR+ placed among the ten most accurate methods. At present, as a web server for the functional and structural protein sequence annotation, BAR+ is freely available at http://bar.biocomp.unibo.it/bar2.0.
Resumo:
According to the latest statistics projections formulated by Eurostat, the proportion of elderly EU-27’s population aged over 65 years old is predicted to increase from 17.5 % in 2011 to 29.5 % by 2060. This "population explosion" makes extremely important to identify the different genetic and molecular mechanisms which underpin the morbidity and mortality along with new strategies able to counteract or slow down its progress. In this scenario fits the European Project MARK-AGE whose aim was to identify a robust set of biomarkers of human ageing able to discriminate between chronological and biological ageing and to derive a model for healthy ageing through the analysis of three populations from different European countries, supposed to be characterized by different ageing rate: 1. Subjects representing the “Normal” or “Physiological” aging. 2. Subjects representing the “successful” or “decelerate” aging 3. Subjects representing the “accelerated” aging. The aim of this work was to recruit and characterize volunteers, to perform an accurate analysis of the health status of elderly recruited subjects (60-79 years) verifying any possible dissimilarity in their aging trajectories, to identify a set of robust ageing biomarkers and investigate possible correlations between ageing biomarkers and health status of recruited volunteers. The model proposed by MARK-AGE Project regarding different ageing trajectories has been confirmed and several ageing biomarkers have been identified.
Resumo:
The present thesis focuses on the problem of robust output regulation for minimum phase nonlinear systems by means of identification techniques. Given a controlled plant and an exosystem (an autonomous system that generates eventual references or disturbances), the control goal is to design a proper regulator able to process the only measure available, i.e the error/output variable, in order to make it asymptotically vanishing. In this context, such a regulator can be designed following the well known “internal model principle” that states how it is possible to achieve the regulation objective by embedding a replica of the exosystem model in the controller structure. The main problem shows up when the exosystem model is affected by parametric or structural uncertainties, in this case, it is not possible to reproduce the exact behavior of the exogenous system in the regulator and then, it is not possible to achieve the control goal. In this work, the idea is to find a solution to the problem trying to develop a general framework in which coexist both a standard regulator and an estimator able to guarantee (when possible) the best estimate of all uncertainties present in the exosystem in order to give “robustness” to the overall control loop.
Resumo:
In the first chapter, I develop a panel no-cointegration test which extends Pesaran, Shin and Smith (2001)'s bounds test to the panel framework by considering the individual regressions in a Seemingly Unrelated Regression (SUR) system. This allows to take into account unobserved common factors that contemporaneously affect all the units of the panel and provides, at the same time, unit-specific test statistics. Moreover, the approach is particularly suited when the number of individuals of the panel is small relatively to the number of time series observations. I develop the algorithm to implement the test and I use Monte Carlo simulation to analyze the properties of the test. The small sample properties of the test are remarkable, compared to its single equation counterpart. I illustrate the use of the test through a test of Purchasing Power Parity in a panel of EU15 countries. In the second chapter of my PhD thesis, I verify the Expectation Hypothesis of the Term Structure in the repurchasing agreements (repo) market with a new testing approach. I consider an "inexact" formulation of the EHTS, which models a time-varying component in the risk premia and I treat the interest rates as a non-stationary cointegrated system. The effect of the heteroskedasticity is controlled by means of testing procedures (bootstrap and heteroskedasticity correction) which are robust to variance and covariance shifts over time. I fi#nd that the long-run implications of EHTS are verified. A rolling window analysis clarifies that the EHTS is only rejected in periods of turbulence of #financial markets. The third chapter introduces the Stata command "bootrank" which implements the bootstrap likelihood ratio rank test algorithm developed by Cavaliere et al. (2012). The command is illustrated through an empirical application on the term structure of interest rates in the US.