5 resultados para Classification Methods

em AMS Tesi di Dottorato - Alm@DL - Università di Bologna


Relevância:

60.00% 60.00%

Publicador:

Resumo:

The present study is part of the EU Integrated Project “GEHA – Genetics of Healthy Aging” (Franceschi C et al., Ann N Y Acad Sci. 1100: 21-45, 2007), whose aim is to identify genes involved in healthy aging and longevity, which allow individuals to survive to advanced age in good cognitive and physical function and in the absence of major age-related diseases. Aims The major aims of this thesis were the following: 1. to outline the recruitment procedure of 90+ Italian siblings performed by the recruiting units of the University of Bologna (UNIBO) and Rome (ISS). The procedures related to the following items necessary to perform the study were described and commented: identification of the eligible area for recruitment, demographic aspects related to the need of getting census lists of 90+siblings, mail and phone contact with 90+ subjects and their families, bioethics aspects of the whole procedure, standardization of the recruitment methodology and set-up of a detailed flow chart to be followed by the European recruitment centres (obtainment of the informed consent form, anonimization of data by using a special code, how to perform the interview, how to collect the blood, how to enter data in the GEHA Phenotypic Data Base hosted at Odense). 2. to provide an overview of the phenotypic characteristics of 90+ Italian siblings recruited by the recruiting units of the University of Bologna (UNIBO) and Rome (ISS). The following items were addressed: socio-demographic characteristics, health status, cognitive assessment, physical conditions (handgrip strength test, chair-stand test, physical ability including ADL, vision and hearing ability, movement ability and doing light housework), life-style information (smoking and drinking habits) and subjective well-being (attitude towards life). Moreover, haematological parameters collected in the 90+ sibpairs as optional parameters by the Bologna and Rome recruiting units were used for a more comprehensive evaluation of the results obtained using the above mentioned phenotypic characteristics reported in the GEHA questionnaire. 3. to assess 90+ Italian siblings as far as their health/functional status is concerned on the basis of three classification methods proposed in previous studies on centenarians, which are based on: • actual functional capabilities (ADL, SMMSE, visual and hearing abilities) (Gondo et al., J Gerontol. 61A (3): 305-310, 2006); • actual functional capabilities and morbidity (ADL, ability to walk, SMMSE, presence of cancer, ictus, renal failure, anaemia, and liver diseases) (Franceschi et al., Aging Clin Exp Res, 12:77-84, 2000); • retrospectively collected data about past history of morbidity and age of disease onset (hypertension, heart disease, diabetes, stroke, cancer, osteopororis, neurological diseases, chronic obstructive pulmonary disease and ocular diseases) (Evert et al., J Gerontol A Biol Sci Med Sci. 58A (3): 232-237, 2003). Firstly these available models to define the health status of long-living subjects were applied to the sample and, since the classifications by Gondo and Franceschi are both based on the present functional status, they were compared in order to better recognize the healthy aging phenotype and to identify the best group of 90+ subjects out of the entire studied population. 4. to investigate the concordance of health and functional status among 90+ siblings in order to divide sibpairs in three categories: the best (both sibs are in good shape), the worst (both sibs are in bad shape) and an intermediate group (one sib is in good shape and the other is in bad shape). Moreover, the evaluation wanted to discover which variables are concordant among siblings; thus, concordant variables could be considered as familiar variables (determined by the environment or by genetics). 5. to perform a survival analysis by using mortality data at 1st January 2009 from the follow-up as the main outcome and selected functional and clinical parameters as explanatory variables. Methods A total of 765 90+ Italian subjects recruited by UNIBO (549 90+ siblings, belonging to 258 families) and ISS (216 90+ siblings, belonging to 106 families) recruiting units are included in the analysis. Each subject was interviewed according to a standardized questionnaire, comprising extensively utilized questions that have been validated in previous European studies on elderly subjects and covering demographic information, life style, living conditions, cognitive status (SMMSE), mood, health status and anthropometric measurements. Moreover, subjects were asked to perform some physical tests (Hand Grip Strength test and Chair Standing test) and a sample of about 24 mL of blood was collected and then processed according to a common protocol for the preparation and storage of DNA aliquots. Results From the analysis the main findings are the following: - a standardized protocol to assess cognitive status, physical performances and health status of European nonagenarian subjects was set up, in respect to ethical requirements, and it is available as a reference for other studies in this field; - GEHA families are enriched in long-living members and extreme survival, and represent an appropriate model for the identification of genes involved in healthy aging and longevity; - two simplified sets of criteria to classify 90+ sibling according to their health status were proposed, as operational tools for distinguishing healthy from non healthy subjects; - cognitive and functional parameters have a major role in categorizing 90+ siblings for the health status; - parameters such as education and good physical abilities (500 metres walking ability, going up and down the stairs ability, high scores at hand grip and chair stand tests) are associated with a good health status (defined as “cognitive unimpairment and absence of disability”); - male nonagenarians show a more homogeneous phenotype than females, and, though far fewer in number, tend to be healthier than females; - in males the good health status is not protective for survival, confirming the male-female health survival paradox; - survival after age 90 was dependent mainly on intact cognitive status and absence of functional disabilities; - haemoglobin and creatinine levels are both associated with longevity; - the most concordant items among 90+ siblings are related to the functional status, indicating that they contain a familiar component. It is still to be investigated at what level this familiar component is determined by genetics or by environment or by the interaction between genetics, environment and chance (and at what level). Conclusions In conclusion, we could state that this study, in accordance with the main objectives of the whole GEHA project, represents one of the first attempt to identify the biological and non biological determinants of successful/unsuccessful aging and longevity. Here, the analysis was performed on 90+ siblings recruited in Northern and Central Italy and it could be used as a reference for others studies in this field on Italian population. Moreover, it contributed to the definition of “successful” and “unsuccessful” aging and categorising a very large cohort of our most elderly subjects into “successful” and “unsuccessful” groups provided an unrivalled opportunity to detect some of the basic genetic/molecular mechanisms which underpin good health as opposed to chronic disability. Discoveries in the topic of the biological determinants of healthy aging represent a real possibility to identify new markers to be utilized for the identification of subgroups of old European citizens having a higher risk to develop age-related diseases and disabilities and to direct major preventive medicine strategies for the new epidemic of chronic disease in the 21st century.

Relevância:

40.00% 40.00%

Publicador:

Resumo:

Information is nowadays a key resource: machine learning and data mining techniques have been developed to extract high-level information from great amounts of data. As most data comes in form of unstructured text in natural languages, research on text mining is currently very active and dealing with practical problems. Among these, text categorization deals with the automatic organization of large quantities of documents in priorly defined taxonomies of topic categories, possibly arranged in large hierarchies. In commonly proposed machine learning approaches, classifiers are automatically trained from pre-labeled documents: they can perform very accurate classification, but often require a consistent training set and notable computational effort. Methods for cross-domain text categorization have been proposed, allowing to leverage a set of labeled documents of one domain to classify those of another one. Most methods use advanced statistical techniques, usually involving tuning of parameters. A first contribution presented here is a method based on nearest centroid classification, where profiles of categories are generated from the known domain and then iteratively adapted to the unknown one. Despite being conceptually simple and having easily tuned parameters, this method achieves state-of-the-art accuracy in most benchmark datasets with fast running times. A second, deeper contribution involves the design of a domain-independent model to distinguish the degree and type of relatedness between arbitrary documents and topics, inferred from the different types of semantic relationships between respective representative words, identified by specific search algorithms. The application of this model is tested on both flat and hierarchical text categorization, where it potentially allows the efficient addition of new categories during classification. Results show that classification accuracy still requires improvements, but models generated from one domain are shown to be effectively able to be reused in a different one.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Statistical modelling and statistical learning theory are two powerful analytical frameworks for analyzing signals and developing efficient processing and classification algorithms. In this thesis, these frameworks are applied for modelling and processing biomedical signals in two different contexts: ultrasound medical imaging systems and primate neural activity analysis and modelling. In the context of ultrasound medical imaging, two main applications are explored: deconvolution of signals measured from a ultrasonic transducer and automatic image segmentation and classification of prostate ultrasound scans. In the former application a stochastic model of the radio frequency signal measured from a ultrasonic transducer is derived. This model is then employed for developing in a statistical framework a regularized deconvolution procedure, for enhancing signal resolution. In the latter application, different statistical models are used to characterize images of prostate tissues, extracting different features. These features are then uses to segment the images in region of interests by means of an automatic procedure based on a statistical model of the extracted features. Finally, machine learning techniques are used for automatic classification of the different region of interests. In the context of neural activity signals, an example of bio-inspired dynamical network was developed to help in studies of motor-related processes in the brain of primate monkeys. The presented model aims to mimic the abstract functionality of a cell population in 7a parietal region of primate monkeys, during the execution of learned behavioural tasks.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Machine learning comprises a series of techniques for automatic extraction of meaningful information from large collections of noisy data. In many real world applications, data is naturally represented in structured form. Since traditional methods in machine learning deal with vectorial information, they require an a priori form of preprocessing. Among all the learning techniques for dealing with structured data, kernel methods are recognized to have a strong theoretical background and to be effective approaches. They do not require an explicit vectorial representation of the data in terms of features, but rely on a measure of similarity between any pair of objects of a domain, the kernel function. Designing fast and good kernel functions is a challenging problem. In the case of tree structured data two issues become relevant: kernel for trees should not be sparse and should be fast to compute. The sparsity problem arises when, given a dataset and a kernel function, most structures of the dataset are completely dissimilar to one another. In those cases the classifier has too few information for making correct predictions on unseen data. In fact, it tends to produce a discriminating function behaving as the nearest neighbour rule. Sparsity is likely to arise for some standard tree kernel functions, such as the subtree and subset tree kernel, when they are applied to datasets with node labels belonging to a large domain. A second drawback of using tree kernels is the time complexity required both in learning and classification phases. Such a complexity can sometimes prevents the kernel application in scenarios involving large amount of data. This thesis proposes three contributions for resolving the above issues of kernel for trees. A first contribution aims at creating kernel functions which adapt to the statistical properties of the dataset, thus reducing its sparsity with respect to traditional tree kernel functions. Specifically, we propose to encode the input trees by an algorithm able to project the data onto a lower dimensional space with the property that similar structures are mapped similarly. By building kernel functions on the lower dimensional representation, we are able to perform inexact matchings between different inputs in the original space. A second contribution is the proposal of a novel kernel function based on the convolution kernel framework. Convolution kernel measures the similarity of two objects in terms of the similarities of their subparts. Most convolution kernels are based on counting the number of shared substructures, partially discarding information about their position in the original structure. The kernel function we propose is, instead, especially focused on this aspect. A third contribution is devoted at reducing the computational burden related to the calculation of a kernel function between a tree and a forest of trees, which is a typical operation in the classification phase and, for some algorithms, also in the learning phase. We propose a general methodology applicable to convolution kernels. Moreover, we show an instantiation of our technique when kernels such as the subtree and subset tree kernels are employed. In those cases, Direct Acyclic Graphs can be used to compactly represent shared substructures in different trees, thus reducing the computational burden and storage requirements.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Different types of proteins exist with diverse functions that are essential for living organisms. An important class of proteins is represented by transmembrane proteins which are specifically designed to be inserted into biological membranes and devised to perform very important functions in the cell such as cell communication and active transport across the membrane. Transmembrane β-barrels (TMBBs) are a sub-class of membrane proteins largely under-represented in structure databases because of the extreme difficulty in experimental structure determination. For this reason, computational tools that are able to predict the structure of TMBBs are needed. In this thesis, two computational problems related to TMBBs were addressed: the detection of TMBBs in large datasets of proteins and the prediction of the topology of TMBB proteins. Firstly, a method for TMBB detection was presented based on a novel neural network framework for variable-length sequence classification. The proposed approach was validated on a non-redundant dataset of proteins. Furthermore, we carried-out genome-wide detection using the entire Escherichia coli proteome. In both experiments, the method significantly outperformed other existing state-of-the-art approaches, reaching very high PPV (92%) and MCC (0.82). Secondly, a method was also introduced for TMBB topology prediction. The proposed approach is based on grammatical modelling and probabilistic discriminative models for sequence data labeling. The method was evaluated using a newly generated dataset of 38 TMBB proteins obtained from high-resolution data in the PDB. Results have shown that the model is able to correctly predict topologies of 25 out of 38 protein chains in the dataset. When tested on previously released datasets, the performances of the proposed approach were measured as comparable or superior to the current state-of-the-art of TMBB topology prediction.