939 resultados para STATISTICAL DATA
Resumo:
Observations of an insect's movement lead to theory on the insect's flight behaviour and the role of movement in the species' population dynamics. This theory leads to predictions of the way the population changes in time under different conditions. If a hypothesis on movement predicts a specific change in the population, then the hypothesis can be tested against observations of population change. Routine pest monitoring of agricultural crops provides a convenient source of data for studying movement into a region and among fields within a region. Examples of the use of statistical and computational methods for testing hypotheses with such data are presented. The types of questions that can be addressed with these methods and the limitations of pest monitoring data when used for this purpose are discussed. (C) 2002 Elsevier Science B.V. All rights reserved.
Resumo:
This paper proposes a template for modelling complex datasets that integrates traditional statistical modelling approaches with more recent advances in statistics and modelling through an exploratory framework. Our approach builds on the well-known and long standing traditional idea of 'good practice in statistics' by establishing a comprehensive framework for modelling that focuses on exploration, prediction, interpretation and reliability assessment, a relatively new idea that allows individual assessment of predictions. The integrated framework we present comprises two stages. The first involves the use of exploratory methods to help visually understand the data and identify a parsimonious set of explanatory variables. The second encompasses a two step modelling process, where the use of non-parametric methods such as decision trees and generalized additive models are promoted to identify important variables and their modelling relationship with the response before a final predictive model is considered. We focus on fitting the predictive model using parametric, non-parametric and Bayesian approaches. This paper is motivated by a medical problem where interest focuses on developing a risk stratification system for morbidity of 1,710 cardiac patients given a suite of demographic, clinical and preoperative variables. Although the methods we use are applied specifically to this case study, these methods can be applied across any field, irrespective of the type of response.
Resumo:
This article presents a research work, the goal of which was to achieve a model for the evaluation of data quality in institutional websites of health units in a broad and balanced way. We have carried out a literature review of the available approaches for the evaluation of website content quality, in order to identify the most recurrent dimensions and the attributes, and we have also carried out a Delphi method process with experts in order to reach an adequate set of attributes and their respective weights for the measurement of content quality. The results obtained revealed a high level of consensus among the experts who participated in the Delphi process. On the other hand, the different statistical analysis and techniques implemented are robust and attach confidence to our results and consequent model obtained.
Resumo:
Low noise surfaces have been increasingly considered as a viable and cost-effective alternative to acoustical barriers. However, road planners and administrators frequently lack information on the correlation between the type of road surface and the resulting noise emission profile. To address this problem, a method to identify and classify different types of road pavements was developed, whereby near field road noise is analyzed using statistical learning methods. The vehicle rolling sound signal near the tires and close to the road surface was acquired by two microphones in a special arrangement which implements the Close-Proximity method. A set of features, characterizing the properties of the road pavement, was extracted from the corresponding sound profiles. A feature selection method was used to automatically select those that are most relevant in predicting the type of pavement, while reducing the computational cost. A set of different types of road pavement segments were tested and the performance of the classifier was evaluated. Results of pavement classification performed during a road journey are presented on a map, together with geographical data. This procedure leads to a considerable improvement in the quality of road pavement noise data, thereby increasing the accuracy of road traffic noise prediction models.
Resumo:
This paper presents an investigation into cloud-to-ground lightning activity over the continental territory of Portugal with data collected by the national Lightning Location System. The Lightning Location System in Portugal is first presented. Analyses about geographical, seasonal, and polarity distribution of cloud-to-ground lightning activity and cumulative probability of peak current are carried out. An overall ground flash density map is constructed from the database, which contains the information of more than five years and almost four million records. This map is compared with the thunderstorm days map, produced by the Portuguese Institute of Meteorology, and with the orographic map of Portugal. Finally, conclusions are duly drawn.
Resumo:
OBJECTIVE: To estimate the spatial intensity of urban violence events using wavelet-based methods and emergency room data. METHODS: Information on victims attended at the emergency room of a public hospital in the city of São Paulo, Southeastern Brazil, from January 1, 2002 to January 11, 2003 were obtained from hospital records. The spatial distribution of 3,540 events was recorded and a uniform random procedure was used to allocate records with incomplete addresses. Point processes and wavelet analysis technique were used to estimate the spatial intensity, defined as the expected number of events by unit area. RESULTS: Of all georeferenced points, 59% were accidents and 40% were assaults. There is a non-homogeneous spatial distribution of the events with high concentration in two districts and three large avenues in the southern area of the city of São Paulo. CONCLUSIONS: Hospital records combined with methodological tools to estimate intensity of events are useful to study urban violence. The wavelet analysis is useful in the computation of the expected number of events and their respective confidence bands for any sub-region and, consequently, in the specification of risk estimates that could be used in decision-making processes for public policies.
Resumo:
Independent component analysis (ICA) has recently been proposed as a tool to unmix hyperspectral data. ICA is founded on two assumptions: 1) the observed spectrum vector is a linear mixture of the constituent spectra (endmember spectra) weighted by the correspondent abundance fractions (sources); 2)sources are statistically independent. Independent factor analysis (IFA) extends ICA to linear mixtures of independent sources immersed in noise. Concerning hyperspectral data, the first assumption is valid whenever the multiple scattering among the distinct constituent substances (endmembers) is negligible, and the surface is partitioned according to the fractional abundances. The second assumption, however, is violated, since the sum of abundance fractions associated to each pixel is constant due to physical constraints in the data acquisition process. Thus, sources cannot be statistically independent, this compromising the performance of ICA/IFA algorithms in hyperspectral unmixing. This paper studies the impact of hyperspectral source statistical dependence on ICA and IFA performances. We conclude that the accuracy of these methods tends to improve with the increase of the signature variability, of the number of endmembers, and of the signal-to-noise ratio. In any case, there are always endmembers incorrectly unmixed. We arrive to this conclusion by minimizing the mutual information of simulated and real hyperspectral mixtures. The computation of mutual information is based on fitting mixtures of Gaussians to the observed data. A method to sort ICA and IFA estimates in terms of the likelihood of being correctly unmixed is proposed.
Resumo:
The tongue is the most important and dynamic articulator for speech formation, because of its anatomic aspects (particularly, the large volume of this muscular organ comparatively to the surrounding organs of the vocal tract) and also due to the wide range of movements and flexibility that are involved. In speech communication research, a variety of techniques have been used for measuring the three-dimensional vocal tract shapes. More recently, magnetic resonance imaging (MRI) becomes common; mainly, because this technique allows the collection of a set of static and dynamic images that can represent the entire vocal tract along any orientation. Over the years, different anatomical organs of the vocal tract have been modelled; namely, 2D and 3D tongue models, using parametric or statistical modelling procedures. Our aims are to present and describe some 3D reconstructed models from MRI data, for one subject uttering sustained articulations of some typical Portuguese sounds. Thus, we present a 3D database of the tongue obtained by stack combinations with the subject articulating Portuguese vowels. This 3D knowledge of the speech organs could be very important; especially, for clinical purposes (for example, for the assessment of articulatory impairments followed by tongue surgery in speech rehabilitation), and also for a better understanding of acoustic theory in speech formation.
Resumo:
Modern real-time systems, with a more flexible and adaptive nature, demand approaches for timeliness evaluation based on probabilistic measures of meeting deadlines. In this context, simulation can emerge as an adequate solution to understand and analyze the timing behaviour of actual systems. However, care must be taken with the obtained outputs under the penalty of obtaining results with lack of credibility. Particularly important is to consider that we are more interested in values from the tail of a probability distribution (near worst-case probabilities), instead of deriving confidence on mean values. We approach this subject by considering the random nature of simulation output data. We will start by discussing well known approaches for estimating distributions out of simulation output, and the confidence which can be applied to its mean values. This is the basis for a discussion on the applicability of such approaches to derive confidence on the tail of distributions, where the worst-case is expected to be.
Resumo:
The principal topic of this work is the application of data mining techniques, in particular of machine learning, to the discovery of knowledge in a protein database. In the first chapter a general background is presented. Namely, in section 1.1 we overview the methodology of a Data Mining project and its main algorithms. In section 1.2 an introduction to the proteins and its supporting file formats is outlined. This chapter is concluded with section 1.3 which defines that main problem we pretend to address with this work: determine if an amino acid is exposed or buried in a protein, in a discrete way (i.e.: not continuous), for five exposition levels: 2%, 10%, 20%, 25% and 30%. In the second chapter, following closely the CRISP-DM methodology, whole the process of construction the database that supported this work is presented. Namely, it is described the process of loading data from the Protein Data Bank, DSSP and SCOP. Then an initial data exploration is performed and a simple prediction model (baseline) of the relative solvent accessibility of an amino acid is introduced. It is also introduced the Data Mining Table Creator, a program developed to produce the data mining tables required for this problem. In the third chapter the results obtained are analyzed with statistical significance tests. Initially the several used classifiers (Neural Networks, C5.0, CART and Chaid) are compared and it is concluded that C5.0 is the most suitable for the problem at stake. It is also compared the influence of parameters like the amino acid information level, the amino acid window size and the SCOP class type in the accuracy of the predictive models. The fourth chapter starts with a brief revision of the literature about amino acid relative solvent accessibility. Then, we overview the main results achieved and finally discuss about possible future work. The fifth and last chapter consists of appendices. Appendix A has the schema of the database that supported this thesis. Appendix B has a set of tables with additional information. Appendix C describes the software provided in the DVD accompanying this thesis that allows the reconstruction of the present work.
Resumo:
This Thesis describes the application of automatic learning methods for a) the classification of organic and metabolic reactions, and b) the mapping of Potential Energy Surfaces(PES). The classification of reactions was approached with two distinct methodologies: a representation of chemical reactions based on NMR data, and a representation of chemical reactions from the reaction equation based on the physico-chemical and topological features of chemical bonds. NMR-based classification of photochemical and enzymatic reactions. Photochemical and metabolic reactions were classified by Kohonen Self-Organizing Maps (Kohonen SOMs) and Random Forests (RFs) taking as input the difference between the 1H NMR spectra of the products and the reactants. The development of such a representation can be applied in automatic analysis of changes in the 1H NMR spectrum of a mixture and their interpretation in terms of the chemical reactions taking place. Examples of possible applications are the monitoring of reaction processes, evaluation of the stability of chemicals, or even the interpretation of metabonomic data. A Kohonen SOM trained with a data set of metabolic reactions catalysed by transferases was able to correctly classify 75% of an independent test set in terms of the EC number subclass. Random Forests improved the correct predictions to 79%. With photochemical reactions classified into 7 groups, an independent test set was classified with 86-93% accuracy. The data set of photochemical reactions was also used to simulate mixtures with two reactions occurring simultaneously. Kohonen SOMs and Feed-Forward Neural Networks (FFNNs) were trained to classify the reactions occurring in a mixture based on the 1H NMR spectra of the products and reactants. Kohonen SOMs allowed the correct assignment of 53-63% of the mixtures (in a test set). Counter-Propagation Neural Networks (CPNNs) gave origin to similar results. The use of supervised learning techniques allowed an improvement in the results. They were improved to 77% of correct assignments when an ensemble of ten FFNNs were used and to 80% when Random Forests were used. This study was performed with NMR data simulated from the molecular structure by the SPINUS program. In the design of one test set, simulated data was combined with experimental data. The results support the proposal of linking databases of chemical reactions to experimental or simulated NMR data for automatic classification of reactions and mixtures of reactions. Genome-scale classification of enzymatic reactions from their reaction equation. The MOLMAP descriptor relies on a Kohonen SOM that defines types of bonds on the basis of their physico-chemical and topological properties. The MOLMAP descriptor of a molecule represents the types of bonds available in that molecule. The MOLMAP descriptor of a reaction is defined as the difference between the MOLMAPs of the products and the reactants, and numerically encodes the pattern of bonds that are broken, changed, and made during a chemical reaction. The automatic perception of chemical similarities between metabolic reactions is required for a variety of applications ranging from the computer validation of classification systems, genome-scale reconstruction (or comparison) of metabolic pathways, to the classification of enzymatic mechanisms. Catalytic functions of proteins are generally described by the EC numbers that are simultaneously employed as identifiers of reactions, enzymes, and enzyme genes, thus linking metabolic and genomic information. Different methods should be available to automatically compare metabolic reactions and for the automatic assignment of EC numbers to reactions still not officially classified. In this study, the genome-scale data set of enzymatic reactions available in the KEGG database was encoded by the MOLMAP descriptors, and was submitted to Kohonen SOMs to compare the resulting map with the official EC number classification, to explore the possibility of predicting EC numbers from the reaction equation, and to assess the internal consistency of the EC classification at the class level. A general agreement with the EC classification was observed, i.e. a relationship between the similarity of MOLMAPs and the similarity of EC numbers. At the same time, MOLMAPs were able to discriminate between EC sub-subclasses. EC numbers could be assigned at the class, subclass, and sub-subclass levels with accuracies up to 92%, 80%, and 70% for independent test sets. The correspondence between chemical similarity of metabolic reactions and their MOLMAP descriptors was applied to the identification of a number of reactions mapped into the same neuron but belonging to different EC classes, which demonstrated the ability of the MOLMAP/SOM approach to verify the internal consistency of classifications in databases of metabolic reactions. RFs were also used to assign the four levels of the EC hierarchy from the reaction equation. EC numbers were correctly assigned in 95%, 90%, 85% and 86% of the cases (for independent test sets) at the class, subclass, sub-subclass and full EC number level,respectively. Experiments for the classification of reactions from the main reactants and products were performed with RFs - EC numbers were assigned at the class, subclass and sub-subclass level with accuracies of 78%, 74% and 63%, respectively. In the course of the experiments with metabolic reactions we suggested that the MOLMAP / SOM concept could be extended to the representation of other levels of metabolic information such as metabolic pathways. Following the MOLMAP idea, the pattern of neurons activated by the reactions of a metabolic pathway is a representation of the reactions involved in that pathway - a descriptor of the metabolic pathway. This reasoning enabled the comparison of different pathways, the automatic classification of pathways, and a classification of organisms based on their biochemical machinery. The three levels of classification (from bonds to metabolic pathways) allowed to map and perceive chemical similarities between metabolic pathways even for pathways of different types of metabolism and pathways that do not share similarities in terms of EC numbers. Mapping of PES by neural networks (NNs). In a first series of experiments, ensembles of Feed-Forward NNs (EnsFFNNs) and Associative Neural Networks (ASNNs) were trained to reproduce PES represented by the Lennard-Jones (LJ) analytical potential function. The accuracy of the method was assessed by comparing the results of molecular dynamics simulations (thermal, structural, and dynamic properties) obtained from the NNs-PES and from the LJ function. The results indicated that for LJ-type potentials, NNs can be trained to generate accurate PES to be used in molecular simulations. EnsFFNNs and ASNNs gave better results than single FFNNs. A remarkable ability of the NNs models to interpolate between distant curves and accurately reproduce potentials to be used in molecular simulations is shown. The purpose of the first study was to systematically analyse the accuracy of different NNs. Our main motivation, however, is reflected in the next study: the mapping of multidimensional PES by NNs to simulate, by Molecular Dynamics or Monte Carlo, the adsorption and self-assembly of solvated organic molecules on noble-metal electrodes. Indeed, for such complex and heterogeneous systems the development of suitable analytical functions that fit quantum mechanical interaction energies is a non-trivial or even impossible task. The data consisted of energy values, from Density Functional Theory (DFT) calculations, at different distances, for several molecular orientations and three electrode adsorption sites. The results indicate that NNs require a data set large enough to cover well the diversity of possible interaction sites, distances, and orientations. NNs trained with such data sets can perform equally well or even better than analytical functions. Therefore, they can be used in molecular simulations, particularly for the ethanol/Au (111) interface which is the case studied in the present Thesis. Once properly trained, the networks are able to produce, as output, any required number of energy points for accurate interpolations.
Resumo:
Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para a obtenção do grau de Mestre em Engenharia do Ambiente
Resumo:
Feature selection is a central problem in machine learning and pattern recognition. On large datasets (in terms of dimension and/or number of instances), using search-based or wrapper techniques can be cornputationally prohibitive. Moreover, many filter methods based on relevance/redundancy assessment also take a prohibitively long time on high-dimensional. datasets. In this paper, we propose efficient unsupervised and supervised feature selection/ranking filters for high-dimensional datasets. These methods use low-complexity relevance and redundancy criteria, applicable to supervised, semi-supervised, and unsupervised learning, being able to act as pre-processors for computationally intensive methods to focus their attention on smaller subsets of promising features. The experimental results, with up to 10(5) features, show the time efficiency of our methods, with lower generalization error than state-of-the-art techniques, while being dramatically simpler and faster.
Resumo:
Controlled fires in forest areas are frequently used in most Mediterranean countries as a preventive technique to avoid severe wildfires in summer season. In Portugal, this forest management method of fuel mass availability is also used and has shown to be beneficial as annual statistical reports confirm that the decrease of wildfires occurrence have a direct relationship with the controlled fire practice. However prescribed fire can have serious side effects in some forest soil properties. This work shows the changes that occurred in some forest soils properties after a prescribed fire action. The experiments were carried out in soil cover over a natural site of Andaluzitic schist, in Gramelas, Caminha, Portugal, that had not been burn for four years. The composed soil samples were collected from five plots at three different layers (0-3cm, 3-6cm and 6-18cm) during a three-year monitoring period after the prescribed burning. Principal Component Analysis was used to reach the presented conclusions.
Resumo:
This study aims to optimize the water quality monitoring of a polluted watercourse (Leça River, Portugal) through the principal component analysis (PCA) and cluster analysis (CA). These statistical methodologies were applied to physicochemical, bacteriological and ecotoxicological data (with the marine bacterium Vibrio fischeri and the green alga Chlorella vulgaris) obtained with the analysis of water samples monthly collected at seven monitoring sites and during five campaigns (February, May, June, August, and September 2006). The results of some variables were assigned to water quality classes according to national guidelines. Chemical and bacteriological quality data led to classify Leça River water quality as “bad” or “very bad”. PCA and CA identified monitoring sites with similar pollution pattern, giving to site 1 (located in the upstream stretch of the river) a distinct feature from all other sampling sites downstream. Ecotoxicity results corroborated this classification thus revealing differences in space and time. The present study includes not only physical, chemical and bacteriological but also ecotoxicological parameters, which broadens new perspectives in river water characterization. Moreover, the application of PCA and CA is very useful to optimize water quality monitoring networks, defining the minimum number of sites and their location. Thus, these tools can support appropriate management decisions.