20 resultados para Statistical Prediction
Resumo:
The principal topic of this work is the application of data mining techniques, in particular of machine learning, to the discovery of knowledge in a protein database. In the first chapter a general background is presented. Namely, in section 1.1 we overview the methodology of a Data Mining project and its main algorithms. In section 1.2 an introduction to the proteins and its supporting file formats is outlined. This chapter is concluded with section 1.3 which defines that main problem we pretend to address with this work: determine if an amino acid is exposed or buried in a protein, in a discrete way (i.e.: not continuous), for five exposition levels: 2%, 10%, 20%, 25% and 30%. In the second chapter, following closely the CRISP-DM methodology, whole the process of construction the database that supported this work is presented. Namely, it is described the process of loading data from the Protein Data Bank, DSSP and SCOP. Then an initial data exploration is performed and a simple prediction model (baseline) of the relative solvent accessibility of an amino acid is introduced. It is also introduced the Data Mining Table Creator, a program developed to produce the data mining tables required for this problem. In the third chapter the results obtained are analyzed with statistical significance tests. Initially the several used classifiers (Neural Networks, C5.0, CART and Chaid) are compared and it is concluded that C5.0 is the most suitable for the problem at stake. It is also compared the influence of parameters like the amino acid information level, the amino acid window size and the SCOP class type in the accuracy of the predictive models. The fourth chapter starts with a brief revision of the literature about amino acid relative solvent accessibility. Then, we overview the main results achieved and finally discuss about possible future work. The fifth and last chapter consists of appendices. Appendix A has the schema of the database that supported this thesis. Appendix B has a set of tables with additional information. Appendix C describes the software provided in the DVD accompanying this thesis that allows the reconstruction of the present work.
Resumo:
This Thesis describes the application of automatic learning methods for a) the classification of organic and metabolic reactions, and b) the mapping of Potential Energy Surfaces(PES). The classification of reactions was approached with two distinct methodologies: a representation of chemical reactions based on NMR data, and a representation of chemical reactions from the reaction equation based on the physico-chemical and topological features of chemical bonds. NMR-based classification of photochemical and enzymatic reactions. Photochemical and metabolic reactions were classified by Kohonen Self-Organizing Maps (Kohonen SOMs) and Random Forests (RFs) taking as input the difference between the 1H NMR spectra of the products and the reactants. The development of such a representation can be applied in automatic analysis of changes in the 1H NMR spectrum of a mixture and their interpretation in terms of the chemical reactions taking place. Examples of possible applications are the monitoring of reaction processes, evaluation of the stability of chemicals, or even the interpretation of metabonomic data. A Kohonen SOM trained with a data set of metabolic reactions catalysed by transferases was able to correctly classify 75% of an independent test set in terms of the EC number subclass. Random Forests improved the correct predictions to 79%. With photochemical reactions classified into 7 groups, an independent test set was classified with 86-93% accuracy. The data set of photochemical reactions was also used to simulate mixtures with two reactions occurring simultaneously. Kohonen SOMs and Feed-Forward Neural Networks (FFNNs) were trained to classify the reactions occurring in a mixture based on the 1H NMR spectra of the products and reactants. Kohonen SOMs allowed the correct assignment of 53-63% of the mixtures (in a test set). Counter-Propagation Neural Networks (CPNNs) gave origin to similar results. The use of supervised learning techniques allowed an improvement in the results. They were improved to 77% of correct assignments when an ensemble of ten FFNNs were used and to 80% when Random Forests were used. This study was performed with NMR data simulated from the molecular structure by the SPINUS program. In the design of one test set, simulated data was combined with experimental data. The results support the proposal of linking databases of chemical reactions to experimental or simulated NMR data for automatic classification of reactions and mixtures of reactions. Genome-scale classification of enzymatic reactions from their reaction equation. The MOLMAP descriptor relies on a Kohonen SOM that defines types of bonds on the basis of their physico-chemical and topological properties. The MOLMAP descriptor of a molecule represents the types of bonds available in that molecule. The MOLMAP descriptor of a reaction is defined as the difference between the MOLMAPs of the products and the reactants, and numerically encodes the pattern of bonds that are broken, changed, and made during a chemical reaction. The automatic perception of chemical similarities between metabolic reactions is required for a variety of applications ranging from the computer validation of classification systems, genome-scale reconstruction (or comparison) of metabolic pathways, to the classification of enzymatic mechanisms. Catalytic functions of proteins are generally described by the EC numbers that are simultaneously employed as identifiers of reactions, enzymes, and enzyme genes, thus linking metabolic and genomic information. Different methods should be available to automatically compare metabolic reactions and for the automatic assignment of EC numbers to reactions still not officially classified. In this study, the genome-scale data set of enzymatic reactions available in the KEGG database was encoded by the MOLMAP descriptors, and was submitted to Kohonen SOMs to compare the resulting map with the official EC number classification, to explore the possibility of predicting EC numbers from the reaction equation, and to assess the internal consistency of the EC classification at the class level. A general agreement with the EC classification was observed, i.e. a relationship between the similarity of MOLMAPs and the similarity of EC numbers. At the same time, MOLMAPs were able to discriminate between EC sub-subclasses. EC numbers could be assigned at the class, subclass, and sub-subclass levels with accuracies up to 92%, 80%, and 70% for independent test sets. The correspondence between chemical similarity of metabolic reactions and their MOLMAP descriptors was applied to the identification of a number of reactions mapped into the same neuron but belonging to different EC classes, which demonstrated the ability of the MOLMAP/SOM approach to verify the internal consistency of classifications in databases of metabolic reactions. RFs were also used to assign the four levels of the EC hierarchy from the reaction equation. EC numbers were correctly assigned in 95%, 90%, 85% and 86% of the cases (for independent test sets) at the class, subclass, sub-subclass and full EC number level,respectively. Experiments for the classification of reactions from the main reactants and products were performed with RFs - EC numbers were assigned at the class, subclass and sub-subclass level with accuracies of 78%, 74% and 63%, respectively. In the course of the experiments with metabolic reactions we suggested that the MOLMAP / SOM concept could be extended to the representation of other levels of metabolic information such as metabolic pathways. Following the MOLMAP idea, the pattern of neurons activated by the reactions of a metabolic pathway is a representation of the reactions involved in that pathway - a descriptor of the metabolic pathway. This reasoning enabled the comparison of different pathways, the automatic classification of pathways, and a classification of organisms based on their biochemical machinery. The three levels of classification (from bonds to metabolic pathways) allowed to map and perceive chemical similarities between metabolic pathways even for pathways of different types of metabolism and pathways that do not share similarities in terms of EC numbers. Mapping of PES by neural networks (NNs). In a first series of experiments, ensembles of Feed-Forward NNs (EnsFFNNs) and Associative Neural Networks (ASNNs) were trained to reproduce PES represented by the Lennard-Jones (LJ) analytical potential function. The accuracy of the method was assessed by comparing the results of molecular dynamics simulations (thermal, structural, and dynamic properties) obtained from the NNs-PES and from the LJ function. The results indicated that for LJ-type potentials, NNs can be trained to generate accurate PES to be used in molecular simulations. EnsFFNNs and ASNNs gave better results than single FFNNs. A remarkable ability of the NNs models to interpolate between distant curves and accurately reproduce potentials to be used in molecular simulations is shown. The purpose of the first study was to systematically analyse the accuracy of different NNs. Our main motivation, however, is reflected in the next study: the mapping of multidimensional PES by NNs to simulate, by Molecular Dynamics or Monte Carlo, the adsorption and self-assembly of solvated organic molecules on noble-metal electrodes. Indeed, for such complex and heterogeneous systems the development of suitable analytical functions that fit quantum mechanical interaction energies is a non-trivial or even impossible task. The data consisted of energy values, from Density Functional Theory (DFT) calculations, at different distances, for several molecular orientations and three electrode adsorption sites. The results indicate that NNs require a data set large enough to cover well the diversity of possible interaction sites, distances, and orientations. NNs trained with such data sets can perform equally well or even better than analytical functions. Therefore, they can be used in molecular simulations, particularly for the ethanol/Au (111) interface which is the case studied in the present Thesis. Once properly trained, the networks are able to produce, as output, any required number of energy points for accurate interpolations.
Resumo:
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para a obtenção do grau de Mestre em Engenharia do Ambiente
Resumo:
Dissertation presented to obtain a Masters degree in Computer Science
Resumo:
Nonlinear Dynamics, Vol. 29
Resumo:
Proceedings of the European Control Conference, ECC’01, Porto, Portugal, September 2001
Resumo:
RESUMO: Contexto: Indicadores fidedignos da composição corporal são importantes na orientação das estratégias nutricionais de recém-nascidos e pequenos lactentes submetidos a cuidados intensivos. O braço é uma região acessível para avaliar a composição corporal regional, pela medida dos seus compartimentos. A antropometria e a ultrassonografia (US) são métodos não invasivos, relativamente económicos, que podem ser usados à cabeceira do paciente na medição desses compartimentos, embora esses métodos não tenham ainda sido validados neste subgrupo etário. A ressonância magnética (RM) pode ser usada como método de referência na validação da medição dos compartimentos do braço. Objectivo: Validar em lactentes pré-termo, as medidas do braço por antropometria e por US. Métodos: Foi estudada uma coorte de recém-nascidos admitidos consecutivamente na unidade de cuidados intensivos neonatais, com 33 semanas de idade de gestação e peso adequado para a mesma, sem anomalias congénitas major e não submetidas a diuréticos ou oxigenoterapia no momento da avaliação. Nas vésperas da alta, foram efectuadas medições do braço, com ocultação, pelos métodos antropométrico, ultrassonográfico e RM. As medidas antropométricas directas foram: peso (P), comprimento (C), perímetro cefálico (PC), perímetro braquial (PB) e prega cutânea tricipital (PT). As área braquial total, área muscular (AM) e área adiposa foram calculadas pelos métodos de Jeliffee & Jeliffee e de Rolland-Cachera. Utilizando uma sonda PSH-7DLT de 7 Hz no ecógrafo Toshiba SSH 140A foram medidos os perímetros braquial e muscular e calculadas automaticamente as áreas braquial e muscular, sendo a área adiposa obtida por subtracção. Como método de referência foi utilizada a RM – Philips Gyroscan ACS-NT, Power-Track 1000 ®, 1.5 Tesla com uma antena de quadratura do joelho. Na análise estatística foram utilizados os métodos paramétricos e não paramétricos, conforme adequado. Resultados: Foram incluídas 30 crianças, nascidas com ( ±DP) 30.7 ±1.9 semanas de gestação, pesando 1380 ±325g, as quais foram avaliadas às 35.4 ±1.1 semanas de idade corrigida, quando pesavam 1786 ±93g. Nenhuma das medidas antropométricas, individualmente, constitui um indicador aceitável (r2 <0.5) das medições por RM. A melhor e mais simples equação alternativa encontrada é a que estima a AM (r2 = 0.56), derivada dos resultados da análise de regressão múltipla: AMRM = (P x 0.17) + (PB x 5.2) – (C x 6) – 150, sendo o P expresso em g, o C e o PB em cm. Nenhuma das medidas ultrassonográficas constitui um indicador aceitável (r2 <0.4) das medições por RM. Conclusões: A antropometria e as medidas ultrassonográficas do braço não são indicadores fidedignos da composição corporal regional em lactentes pré-termo, adequados para a idade de gestação.----------ABSTRACT: Background: Accurate predictors for body composition are valuable tools guiding nutritional strategies in infants needing intensive care. The upper-arm is a part of the body that is easily accessible and convenient for assessing the regional body composition, throughout the assessment of their compartments. Anthropometry and by ultrasonography (US) are noninvasive and relatively nonexpensive methods for bedside assessment of the upper-arm compartments. However, these methods have not yet been validated in infants. Magnetic resonance imaging (MRI) may be used as gold standard to validate the measurements of the upper-arm compartments. Objective: To validate the upper-arm measurements by anthropometry and by US in preterm infants. Methods: A cohort of neonates consecutively admitted at the neonatal intensive care unit, appropriate for gestational age, with 33 weeks, without major congenital abnormalities and not subjected to diuretics or oxygen therapy, was assessed. Before the discharge, the upper-arm was blindly measured by anthropometry, US and MRI. The direct anthropometric parameters measured were: weight (W), length (L), head circumference (HC), mid-arm circumference (MAC), and tricipital skinfold thickness. The arm area (AA), arm muscle area (AMA) and arm fat area were calculated applying the methods proposed by Jeliffee & Jeliffee and by Rolland-Cachera. Using the sonolayer Toshiba SSH 140A and the probe PSH-7DLT 7Hz, the arm and muscle perimeters were measured by US, the arm and muscle areas included were automatically calculated, and the fat area was calculated by subtraction. The MR images were acquired on a 1.5-T Philips Gyroscan ACS-NT, Power-Track 1000 scanner, and a knee coil was chosen for the upper-arm measurements. For statistical analysis parametric and nonparametric methods were used as appropriate. Results: Thirty infants born with ( ±SD) 30.7 ±1.9 weeks of gestational age and weighing 1380 ±325g were included in the study; they were assessed at 35.4 ±1.1 weeks of corrected age, weighing 1786 ±93g. None of the anthropometric measurements are individually acceptable (r2 <0.5) for prediction of the measurements obtained by MRI. The best and simple alternative equation found is the equation for prediction of the AMA (r2 = 0.56), derived from the results of multiple regression analysis: AMARM = (W x 0.17) + (MAC x 5.2) – (L x 6) – 150, being the W expressed in g, and L and MAC in cm. None of the ultrasonographic measurements are acceptable (r2 <0.5) predictors for the measurements obtained by MRI. Conclusions: The measurements of the upper-arm by anthropometry and by US are not accurate predictors for the regional body composition in preterm appropriate for gestational age infants.
Resumo:
Background: Little is known about the risk of progression to hazardous alcohol use in people currently drinking at safe limits. We aimed to develop a prediction model (predictAL) for the development of hazardous drinking in safe drinkers. Methods: A prospective cohort study of adult general practice attendees in six European countries and Chile followed up over 6 months. We recruited 10,045 attendees between April 2003 to February 2005. 6193 European and 2462 Chilean attendees recorded AUDIT scores below 8 in men and 5 in women at recruitment and were used in modelling risk. 38 risk factors were measured to construct a risk model for the development of hazardous drinking using stepwise logistic regression. The model was corrected for over fitting and tested in an external population. The main outcome was hazardous drinking defined by an AUDIT score >= 8 in men and >= 5 in women. Results: 69.0% of attendees were recruited, of whom 89.5% participated again after six months. The risk factors in the final predictAL model were sex, age, country, baseline AUDIT score, panic syndrome and lifetime alcohol problem. The predictAL model's average c-index across all six European countries was 0.839 (95% CI 0.805, 0.873). The Hedge's g effect size for the difference in log odds of predicted probability between safe drinkers in Europe who subsequently developed hazardous alcohol use and those who did not was 1.38 (95% CI 1.25, 1.51). External validation of the algorithm in Chilean safe drinkers resulted in a c-index of 0.781 (95% CI 0.717, 0.846) and Hedge's g of 0.68 (95% CI 0.57, 0.78). Conclusions: The predictAL risk model for development of hazardous consumption in safe drinkers compares favourably with risk algorithms for disorders in other medical settings and can be a useful first step in prevention of alcohol misuse.
Resumo:
Dissertação apresentada como requisito parcial para obtenção do grau de Mestre em Estatística e Gestão de Informação
Resumo:
Dissertation submitted in partial fulfillment of the requirements for the Degree of Master of Science in Geospatial Technologies.
Resumo:
A Work Project, presented as part of the requirements for the Award of a Masters Degree in Finance from the NOVA – School of Business and Economics
Resumo:
A Work Project, presented as part of the requirements for the Award of a Masters Degree in Economics from the NOVA – School of Business and Economics
Resumo:
A Work Project, presented as part of the requirements for the Award of a Masters Degree in Management from the NOVA – School of Business and Economics
Resumo:
Dissertação para obtenção do Grau de Mestre em Engenharia Geológica (Georrecursos)