5 resultados para Marchuk, William M.: A life science living lexicon. CD-ROM for Macintosh or Windows
em AMS Tesi di Dottorato - Alm@DL - Università di Bologna
Regularization meets GreenAI: a new framework for image reconstruction in life sciences applications
Resumo:
Ill-conditioned inverse problems frequently arise in life sciences, particularly in the context of image deblurring and medical image reconstruction. These problems have been addressed through iterative variational algorithms, which regularize the reconstruction by adding prior knowledge about the problem's solution. Despite the theoretical reliability of these methods, their practical utility is constrained by the time required to converge. Recently, the advent of neural networks allowed the development of reconstruction algorithms that can compute highly accurate solutions with minimal time demands. Regrettably, it is well-known that neural networks are sensitive to unexpected noise, and the quality of their reconstructions quickly deteriorates when the input is slightly perturbed. Modern efforts to address this challenge have led to the creation of massive neural network architectures, but this approach is unsustainable from both ecological and economic standpoints. The recently introduced GreenAI paradigm argues that developing sustainable neural network models is essential for practical applications. In this thesis, we aim to bridge the gap between theory and practice by introducing a novel framework that combines the reliability of model-based iterative algorithms with the speed and accuracy of end-to-end neural networks. Additionally, we demonstrate that our framework yields results comparable to state-of-the-art methods while using relatively small, sustainable models. In the first part of this thesis, we discuss the proposed framework from a theoretical perspective. We provide an extension of classical regularization theory, applicable in scenarios where neural networks are employed to solve inverse problems, and we show there exists a trade-off between accuracy and stability. Furthermore, we demonstrate the effectiveness of our methods in common life science-related scenarios. In the second part of the thesis, we initiate an exploration extending the proposed method into the probabilistic domain. We analyze some properties of deep generative models, revealing their potential applicability in addressing ill-posed inverse problems.
Resumo:
In the past decade, the advent of efficient genome sequencing tools and high-throughput experimental biotechnology has lead to enormous progress in the life science. Among the most important innovations is the microarray tecnology. It allows to quantify the expression for thousands of genes simultaneously by measurin the hybridization from a tissue of interest to probes on a small glass or plastic slide. The characteristics of these data include a fair amount of random noise, a predictor dimension in the thousand, and a sample noise in the dozens. One of the most exciting areas to which microarray technology has been applied is the challenge of deciphering complex disease such as cancer. In these studies, samples are taken from two or more groups of individuals with heterogeneous phenotypes, pathologies, or clinical outcomes. these samples are hybridized to microarrays in an effort to find a small number of genes which are strongly correlated with the group of individuals. Eventhough today methods to analyse the data are welle developed and close to reach a standard organization (through the effort of preposed International project like Microarray Gene Expression Data -MGED- Society [1]) it is not unfrequant to stumble in a clinician's question that do not have a compelling statistical method that could permit to answer it.The contribution of this dissertation in deciphering disease regards the development of new approaches aiming at handle open problems posed by clinicians in handle specific experimental designs. In Chapter 1 starting from a biological necessary introduction, we revise the microarray tecnologies and all the important steps that involve an experiment from the production of the array, to the quality controls ending with preprocessing steps that will be used into the data analysis in the rest of the dissertation. While in Chapter 2 a critical review of standard analysis methods are provided stressing most of problems that In Chapter 3 is introduced a method to adress the issue of unbalanced design of miacroarray experiments. In microarray experiments, experimental design is a crucial starting-point for obtaining reasonable results. In a two-class problem, an equal or similar number of samples it should be collected between the two classes. However in some cases, e.g. rare pathologies, the approach to be taken is less evident. We propose to address this issue by applying a modified version of SAM [2]. MultiSAM consists in a reiterated application of a SAM analysis, comparing the less populated class (LPC) with 1,000 random samplings of the same size from the more populated class (MPC) A list of the differentially expressed genes is generated for each SAM application. After 1,000 reiterations, each single probe given a "score" ranging from 0 to 1,000 based on its recurrence in the 1,000 lists as differentially expressed. The performance of MultiSAM was compared to the performance of SAM and LIMMA [3] over two simulated data sets via beta and exponential distribution. The results of all three algorithms over low- noise data sets seems acceptable However, on a real unbalanced two-channel data set reagardin Chronic Lymphocitic Leukemia, LIMMA finds no significant probe, SAM finds 23 significantly changed probes but cannot separate the two classes, while MultiSAM finds 122 probes with score >300 and separates the data into two clusters by hierarchical clustering. We also report extra-assay validation in terms of differentially expressed genes Although standard algorithms perform well over low-noise simulated data sets, multi-SAM seems to be the only one able to reveal subtle differences in gene expression profiles on real unbalanced data. In Chapter 4 a method to adress similarities evaluation in a three-class prblem by means of Relevance Vector Machine [4] is described. In fact, looking at microarray data in a prognostic and diagnostic clinical framework, not only differences could have a crucial role. In some cases similarities can give useful and, sometimes even more, important information. The goal, given three classes, could be to establish, with a certain level of confidence, if the third one is similar to the first or the second one. In this work we show that Relevance Vector Machine (RVM) [2] could be a possible solutions to the limitation of standard supervised classification. In fact, RVM offers many advantages compared, for example, with his well-known precursor (Support Vector Machine - SVM [3]). Among these advantages, the estimate of posterior probability of class membership represents a key feature to address the similarity issue. This is a highly important, but often overlooked, option of any practical pattern recognition system. We focused on Tumor-Grade-three-class problem, so we have 67 samples of grade I (G1), 54 samples of grade 3 (G3) and 100 samples of grade 2 (G2). The goal is to find a model able to separate G1 from G3, then evaluate the third class G2 as test-set to obtain the probability for samples of G2 to be member of class G1 or class G3. The analysis showed that breast cancer samples of grade II have a molecular profile more similar to breast cancer samples of grade I. Looking at the literature this result have been guessed, but no measure of significance was gived before.
Resumo:
The study of protein fold is a central problem in life science, leading in the last years to several attempts for improving our knowledge of the protein structures. In this thesis this challenging problem is tackled by means of molecular dynamics, chirality and NMR studies. In the last decades, many algorithms were designed for the protein secondary structure assignment, which reveals the local protein shape adopted by segments of amino acids. In this regard, the use of local chirality for the protein secondary structure assignment was demonstreted, trying to correlate as well the propensity of a given amino acid for a particular secondary structure. The protein fold can be studied also by Nuclear Magnetic Resonance (NMR) investigations, finding the average structure adopted from a protein. In this context, the effect of Residual Dipolar Couplings (RDCs) in the structure refinement was shown, revealing a strong improvement of structure resolution. A wide extent of this thesis is devoted to the study of avian prion protein. Prion protein is the main responsible of a vast class of neurodegenerative diseases, known as Bovine Spongiform Encephalopathy (BSE), present in mammals, but not in avian species and it is caused from the conversion of cellular prion protein to the pathogenic misfolded isoform, accumulating in the brain in form of amiloyd plaques. In particular, the N-terminal region, namely the initial part of the protein, is quite different between mammal and avian species but both of them contain multimeric sequences called Repeats, octameric in mammals and hexameric in avians. However, such repeat regions show differences in the contained amino acids, in particular only avian hexarepeats contain tyrosine residues. The chirality analysis of avian prion protein configurations obtained from molecular dynamics reveals a high stiffness of the avian protein, which tends to preserve its regular secondary structure. This is due to the presence of prolines, histidines and especially tyrosines, which form a hydrogen bond network in the hexarepeat region, only possible in the avian protein, and thus probably hampering the aggregation.
Resumo:
The present study is part of the EU Integrated Project “GEHA – Genetics of Healthy Aging” (Franceschi C et al., Ann N Y Acad Sci. 1100: 21-45, 2007), whose aim is to identify genes involved in healthy aging and longevity, which allow individuals to survive to advanced age in good cognitive and physical function and in the absence of major age-related diseases. Aims The major aims of this thesis were the following: 1. to outline the recruitment procedure of 90+ Italian siblings performed by the recruiting units of the University of Bologna (UNIBO) and Rome (ISS). The procedures related to the following items necessary to perform the study were described and commented: identification of the eligible area for recruitment, demographic aspects related to the need of getting census lists of 90+siblings, mail and phone contact with 90+ subjects and their families, bioethics aspects of the whole procedure, standardization of the recruitment methodology and set-up of a detailed flow chart to be followed by the European recruitment centres (obtainment of the informed consent form, anonimization of data by using a special code, how to perform the interview, how to collect the blood, how to enter data in the GEHA Phenotypic Data Base hosted at Odense). 2. to provide an overview of the phenotypic characteristics of 90+ Italian siblings recruited by the recruiting units of the University of Bologna (UNIBO) and Rome (ISS). The following items were addressed: socio-demographic characteristics, health status, cognitive assessment, physical conditions (handgrip strength test, chair-stand test, physical ability including ADL, vision and hearing ability, movement ability and doing light housework), life-style information (smoking and drinking habits) and subjective well-being (attitude towards life). Moreover, haematological parameters collected in the 90+ sibpairs as optional parameters by the Bologna and Rome recruiting units were used for a more comprehensive evaluation of the results obtained using the above mentioned phenotypic characteristics reported in the GEHA questionnaire. 3. to assess 90+ Italian siblings as far as their health/functional status is concerned on the basis of three classification methods proposed in previous studies on centenarians, which are based on: • actual functional capabilities (ADL, SMMSE, visual and hearing abilities) (Gondo et al., J Gerontol. 61A (3): 305-310, 2006); • actual functional capabilities and morbidity (ADL, ability to walk, SMMSE, presence of cancer, ictus, renal failure, anaemia, and liver diseases) (Franceschi et al., Aging Clin Exp Res, 12:77-84, 2000); • retrospectively collected data about past history of morbidity and age of disease onset (hypertension, heart disease, diabetes, stroke, cancer, osteopororis, neurological diseases, chronic obstructive pulmonary disease and ocular diseases) (Evert et al., J Gerontol A Biol Sci Med Sci. 58A (3): 232-237, 2003). Firstly these available models to define the health status of long-living subjects were applied to the sample and, since the classifications by Gondo and Franceschi are both based on the present functional status, they were compared in order to better recognize the healthy aging phenotype and to identify the best group of 90+ subjects out of the entire studied population. 4. to investigate the concordance of health and functional status among 90+ siblings in order to divide sibpairs in three categories: the best (both sibs are in good shape), the worst (both sibs are in bad shape) and an intermediate group (one sib is in good shape and the other is in bad shape). Moreover, the evaluation wanted to discover which variables are concordant among siblings; thus, concordant variables could be considered as familiar variables (determined by the environment or by genetics). 5. to perform a survival analysis by using mortality data at 1st January 2009 from the follow-up as the main outcome and selected functional and clinical parameters as explanatory variables. Methods A total of 765 90+ Italian subjects recruited by UNIBO (549 90+ siblings, belonging to 258 families) and ISS (216 90+ siblings, belonging to 106 families) recruiting units are included in the analysis. Each subject was interviewed according to a standardized questionnaire, comprising extensively utilized questions that have been validated in previous European studies on elderly subjects and covering demographic information, life style, living conditions, cognitive status (SMMSE), mood, health status and anthropometric measurements. Moreover, subjects were asked to perform some physical tests (Hand Grip Strength test and Chair Standing test) and a sample of about 24 mL of blood was collected and then processed according to a common protocol for the preparation and storage of DNA aliquots. Results From the analysis the main findings are the following: - a standardized protocol to assess cognitive status, physical performances and health status of European nonagenarian subjects was set up, in respect to ethical requirements, and it is available as a reference for other studies in this field; - GEHA families are enriched in long-living members and extreme survival, and represent an appropriate model for the identification of genes involved in healthy aging and longevity; - two simplified sets of criteria to classify 90+ sibling according to their health status were proposed, as operational tools for distinguishing healthy from non healthy subjects; - cognitive and functional parameters have a major role in categorizing 90+ siblings for the health status; - parameters such as education and good physical abilities (500 metres walking ability, going up and down the stairs ability, high scores at hand grip and chair stand tests) are associated with a good health status (defined as “cognitive unimpairment and absence of disability”); - male nonagenarians show a more homogeneous phenotype than females, and, though far fewer in number, tend to be healthier than females; - in males the good health status is not protective for survival, confirming the male-female health survival paradox; - survival after age 90 was dependent mainly on intact cognitive status and absence of functional disabilities; - haemoglobin and creatinine levels are both associated with longevity; - the most concordant items among 90+ siblings are related to the functional status, indicating that they contain a familiar component. It is still to be investigated at what level this familiar component is determined by genetics or by environment or by the interaction between genetics, environment and chance (and at what level). Conclusions In conclusion, we could state that this study, in accordance with the main objectives of the whole GEHA project, represents one of the first attempt to identify the biological and non biological determinants of successful/unsuccessful aging and longevity. Here, the analysis was performed on 90+ siblings recruited in Northern and Central Italy and it could be used as a reference for others studies in this field on Italian population. Moreover, it contributed to the definition of “successful” and “unsuccessful” aging and categorising a very large cohort of our most elderly subjects into “successful” and “unsuccessful” groups provided an unrivalled opportunity to detect some of the basic genetic/molecular mechanisms which underpin good health as opposed to chronic disability. Discoveries in the topic of the biological determinants of healthy aging represent a real possibility to identify new markers to be utilized for the identification of subgroups of old European citizens having a higher risk to develop age-related diseases and disabilities and to direct major preventive medicine strategies for the new epidemic of chronic disease in the 21st century.
Resumo:
Although the debate of what data science is has a long history and has not reached a complete consensus yet, Data Science can be summarized as the process of learning from data. Guided by the above vision, this thesis presents two independent data science projects developed in the scope of multidisciplinary applied research. The first part analyzes fluorescence microscopy images typically produced in life science experiments, where the objective is to count how many marked neuronal cells are present in each image. Aiming to automate the task for supporting research in the area, we propose a neural network architecture tuned specifically for this use case, cell ResUnet (c-ResUnet), and discuss the impact of alternative training strategies in overcoming particular challenges of our data. The approach provides good results in terms of both detection and counting, showing performance comparable to the interpretation of human operators. As a meaningful addition, we release the pre-trained model and the Fluorescent Neuronal Cells dataset collecting pixel-level annotations of where neuronal cells are located. In this way, we hope to help future research in the area and foster innovative methodologies for tackling similar problems. The second part deals with the problem of distributed data management in the context of LHC experiments, with a focus on supporting ATLAS operations concerning data transfer failures. In particular, we analyze error messages produced by failed transfers and propose a Machine Learning pipeline that leverages the word2vec language model and K-means clustering. This provides groups of similar errors that are presented to human operators as suggestions of potential issues to investigate. The approach is demonstrated on one full day of data, showing promising ability in understanding the message content and providing meaningful groupings, in line with previously reported incidents by human operators.