951 resultados para Spatial Data Quality
Resumo:
The clinical content of administrative databases includes, among others, patient demographic characteristics, and codes for diagnoses and procedures. The data in these databases is standardized, clearly defined, readily available, less expensive than collected by other means, and normally covers hospitalizations in entire geographic areas. Although with some limitations, this data is often used to evaluate the quality of healthcare. Under these circumstances, the quality of the data, for instance, errors, or it completeness, is of central importance and should never be ignored. Both the minimization of data quality problems and a deep knowledge about this data (e.g., how to select a patient group) are important for users in order to trust and to correctly interpret results. In this paper we present, discuss and give some recommendations for some problems found in these administrative databases. We also present a simple tool that can be used to screen the quality of data through the use of domain specific data quality indicators. These indicators can significantly contribute to better data, to give steps towards a continuous increase of data quality and, certainly, to better informed decision-making.
Resumo:
Dissertation submitted in partial fulfillment of the requirements for the Degree of Master of Science in Geospatial Technologies.
Resumo:
Dissertation submitted in partial fulfillment of the requirements for the Degree of Master of Science in Geospatial Technologies.
Resumo:
The main objective of this survey was to perform descriptive analysis of crime evolution in Portugal between 1995 and 2013. The main focus of this survey was to analyse spatial crime evolution patterns in Portuguese NUTS III regions. Most important crime types have been included into analysis. The main idea was to uncover relation between local patterns and global crime evolution; to define regions which have contributed to global crime evolution of some specific crime types and to define how they have contributed. There were many statistical reports and scientific papers which have analysed some particular crime types, but one global spatial-temporal analysis has not been found. Principal Component Analysis and multidimensional descriptive data analysis technique STATIS have been the base of the analysis. The results of this survey has shown that strong spatial and temporal crime patterns exist. It was possible to describe global crime evolution patterns and to define crime evolution patterns in NUTS III regions. It was possible to define three to four groups of crimes where each group shows similar spatial crime dynamics.
Advanced mapping of environmental data: Geostatistics, Machine Learning and Bayesian Maximum Entropy
Resumo:
This book combines geostatistics and global mapping systems to present an up-to-the-minute study of environmental data. Featuring numerous case studies, the reference covers model dependent (geostatistics) and data driven (machine learning algorithms) analysis techniques such as risk mapping, conditional stochastic simulations, descriptions of spatial uncertainty and variability, artificial neural networks (ANN) for spatial data, Bayesian maximum entropy (BME), and more.
Resumo:
This paper deals with the problem of spatial data mapping. A new method based on wavelet interpolation and geostatistical prediction (kriging) is proposed. The method - wavelet analysis residual kriging (WARK) - is developed in order to assess the problems rising for highly variable data in presence of spatial trends. In these cases stationary prediction models have very limited application. Wavelet analysis is used to model large-scale structures and kriging of the remaining residuals focuses on small-scale peculiarities. WARK is able to model spatial pattern which features multiscale structure. In the present work WARK is applied to the rainfall data and the results of validation are compared with the ones obtained from neural network residual kriging (NNRK). NNRK is also a residual-based method, which uses artificial neural network to model large-scale non-linear trends. The comparison of the results demonstrates the high quality performance of WARK in predicting hot spots, reproducing global statistical characteristics of the distribution and spatial correlation structure.
Resumo:
Many regions of the world, including inland lakes, present with suboptimal conditions for the remotely sensed retrieval of optical signals, thus challenging the limits of available satellite data-processing tools, such as atmospheric correction models (ACM) and water constituent-retrieval (WCR) algorithms. Working in such regions, however, can improve our understanding of remote-sensing tools and their applicabil- ity in new contexts, in addition to potentially offering useful information about aquatic ecology. Here, we assess and compare 32 combinations of two ACMs, two WCRs, and three binary categories of data quality standards to optimize a remotely sensed proxy of plankton biomass in Lake Kivu. Each parameter set is compared against the available ground-truth match-ups using Spearman's right-tailed ρ. Focusing on the best sets from each ACM-WCR combination, their performances are discussed with regard to data distribution, sample size, spatial completeness, and seasonality. The results of this study may be of interest both for ecological studies on Lake Kivu and for epidemio- logical studies of disease, such as cholera, the dynamics of which has been associated with plankton biomass in other regions of the world.
Resumo:
Polistine wasps are important in Neotropical ecosystems due to their ubiquity and diversity. Inventories have not adequately considered spatial attributes of collected specimens. Spatial data on biodiversity are important for study and mitigation of anthropogenic impacts over natural ecosystems and for protecting species. We described and analyzed local-scale spatial patterns of collecting records of wasp species, as well as spatial variation of diversity descriptors in a 2500-hectare area of an Amazon forest in Brazil. Rare species comprised the largest fraction of the fauna. Close range spatial effects were detected for most of the more common species, with clustering of presence-data at short distances. Larger spatial lag effects could also be identified in some species, constituting probably cases of exogenous autocorrelation and candidates for explanations based on environmental factors. In a few cases, significant or near significant correlations were found between five species (of Agelaia, Angiopolybia, and Mischocyttarus) and three studied environmental variables: distance to nearest stream, terrain altitude, and the type of forest canopy. However, association between these factors and biodiversity variables were generally low. When used as predictors of polistine richness in a linear multiple regression, only the coefficient for the forest canopy variable resulted significant. Some level of prediction of wasp diversity variables can be attained based on environmental variables, especially vegetation structure. Large-scale landscape and regional studies should be scheduled to address this issue.
Resumo:
The research considers the problem of spatial data classification using machine learning algorithms: probabilistic neural networks (PNN) and support vector machines (SVM). As a benchmark model simple k-nearest neighbor algorithm is considered. PNN is a neural network reformulation of well known nonparametric principles of probability density modeling using kernel density estimator and Bayesian optimal or maximum a posteriori decision rules. PNN is well suited to problems where not only predictions but also quantification of accuracy and integration of prior information are necessary. An important property of PNN is that they can be easily used in decision support systems dealing with problems of automatic classification. Support vector machine is an implementation of the principles of statistical learning theory for the classification tasks. Recently they were successfully applied for different environmental topics: classification of soil types and hydro-geological units, optimization of monitoring networks, susceptibility mapping of natural hazards. In the present paper both simulated and real data case studies (low and high dimensional) are considered. The main attention is paid to the detection and learning of spatial patterns by the algorithms applied.
Resumo:
Résumé Cette thèse est consacrée à l'analyse, la modélisation et la visualisation de données environnementales à référence spatiale à l'aide d'algorithmes d'apprentissage automatique (Machine Learning). L'apprentissage automatique peut être considéré au sens large comme une sous-catégorie de l'intelligence artificielle qui concerne particulièrement le développement de techniques et d'algorithmes permettant à une machine d'apprendre à partir de données. Dans cette thèse, les algorithmes d'apprentissage automatique sont adaptés pour être appliqués à des données environnementales et à la prédiction spatiale. Pourquoi l'apprentissage automatique ? Parce que la majorité des algorithmes d'apprentissage automatiques sont universels, adaptatifs, non-linéaires, robustes et efficaces pour la modélisation. Ils peuvent résoudre des problèmes de classification, de régression et de modélisation de densité de probabilités dans des espaces à haute dimension, composés de variables informatives spatialisées (« géo-features ») en plus des coordonnées géographiques. De plus, ils sont idéaux pour être implémentés en tant qu'outils d'aide à la décision pour des questions environnementales allant de la reconnaissance de pattern à la modélisation et la prédiction en passant par la cartographie automatique. Leur efficacité est comparable au modèles géostatistiques dans l'espace des coordonnées géographiques, mais ils sont indispensables pour des données à hautes dimensions incluant des géo-features. Les algorithmes d'apprentissage automatique les plus importants et les plus populaires sont présentés théoriquement et implémentés sous forme de logiciels pour les sciences environnementales. Les principaux algorithmes décrits sont le Perceptron multicouches (MultiLayer Perceptron, MLP) - l'algorithme le plus connu dans l'intelligence artificielle, le réseau de neurones de régression généralisée (General Regression Neural Networks, GRNN), le réseau de neurones probabiliste (Probabilistic Neural Networks, PNN), les cartes auto-organisées (SelfOrganized Maps, SOM), les modèles à mixture Gaussiennes (Gaussian Mixture Models, GMM), les réseaux à fonctions de base radiales (Radial Basis Functions Networks, RBF) et les réseaux à mixture de densité (Mixture Density Networks, MDN). Cette gamme d'algorithmes permet de couvrir des tâches variées telle que la classification, la régression ou l'estimation de densité de probabilité. L'analyse exploratoire des données (Exploratory Data Analysis, EDA) est le premier pas de toute analyse de données. Dans cette thèse les concepts d'analyse exploratoire de données spatiales (Exploratory Spatial Data Analysis, ESDA) sont traités selon l'approche traditionnelle de la géostatistique avec la variographie expérimentale et selon les principes de l'apprentissage automatique. La variographie expérimentale, qui étudie les relations entre pairs de points, est un outil de base pour l'analyse géostatistique de corrélations spatiales anisotropiques qui permet de détecter la présence de patterns spatiaux descriptible par une statistique. L'approche de l'apprentissage automatique pour l'ESDA est présentée à travers l'application de la méthode des k plus proches voisins qui est très simple et possède d'excellentes qualités d'interprétation et de visualisation. Une part importante de la thèse traite de sujets d'actualité comme la cartographie automatique de données spatiales. Le réseau de neurones de régression généralisée est proposé pour résoudre cette tâche efficacement. Les performances du GRNN sont démontrées par des données de Comparaison d'Interpolation Spatiale (SIC) de 2004 pour lesquelles le GRNN bat significativement toutes les autres méthodes, particulièrement lors de situations d'urgence. La thèse est composée de quatre chapitres : théorie, applications, outils logiciels et des exemples guidés. Une partie importante du travail consiste en une collection de logiciels : Machine Learning Office. Cette collection de logiciels a été développée durant les 15 dernières années et a été utilisée pour l'enseignement de nombreux cours, dont des workshops internationaux en Chine, France, Italie, Irlande et Suisse ainsi que dans des projets de recherche fondamentaux et appliqués. Les cas d'études considérés couvrent un vaste spectre de problèmes géoenvironnementaux réels à basse et haute dimensionnalité, tels que la pollution de l'air, du sol et de l'eau par des produits radioactifs et des métaux lourds, la classification de types de sols et d'unités hydrogéologiques, la cartographie des incertitudes pour l'aide à la décision et l'estimation de risques naturels (glissements de terrain, avalanches). Des outils complémentaires pour l'analyse exploratoire des données et la visualisation ont également été développés en prenant soin de créer une interface conviviale et facile à l'utilisation. Machine Learning for geospatial data: algorithms, software tools and case studies Abstract The thesis is devoted to the analysis, modeling and visualisation of spatial environmental data using machine learning algorithms. In a broad sense machine learning can be considered as a subfield of artificial intelligence. It mainly concerns with the development of techniques and algorithms that allow computers to learn from data. In this thesis machine learning algorithms are adapted to learn from spatial environmental data and to make spatial predictions. Why machine learning? In few words most of machine learning algorithms are universal, adaptive, nonlinear, robust and efficient modeling tools. They can find solutions for the classification, regression, and probability density modeling problems in high-dimensional geo-feature spaces, composed of geographical space and additional relevant spatially referenced features. They are well-suited to be implemented as predictive engines in decision support systems, for the purposes of environmental data mining including pattern recognition, modeling and predictions as well as automatic data mapping. They have competitive efficiency to the geostatistical models in low dimensional geographical spaces but are indispensable in high-dimensional geo-feature spaces. The most important and popular machine learning algorithms and models interesting for geo- and environmental sciences are presented in details: from theoretical description of the concepts to the software implementation. The main algorithms and models considered are the following: multi-layer perceptron (a workhorse of machine learning), general regression neural networks, probabilistic neural networks, self-organising (Kohonen) maps, Gaussian mixture models, radial basis functions networks, mixture density networks. This set of models covers machine learning tasks such as classification, regression, and density estimation. Exploratory data analysis (EDA) is initial and very important part of data analysis. In this thesis the concepts of exploratory spatial data analysis (ESDA) is considered using both traditional geostatistical approach such as_experimental variography and machine learning. Experimental variography is a basic tool for geostatistical analysis of anisotropic spatial correlations which helps to understand the presence of spatial patterns, at least described by two-point statistics. A machine learning approach for ESDA is presented by applying the k-nearest neighbors (k-NN) method which is simple and has very good interpretation and visualization properties. Important part of the thesis deals with a hot topic of nowadays, namely, an automatic mapping of geospatial data. General regression neural networks (GRNN) is proposed as efficient model to solve this task. Performance of the GRNN model is demonstrated on Spatial Interpolation Comparison (SIC) 2004 data where GRNN model significantly outperformed all other approaches, especially in case of emergency conditions. The thesis consists of four chapters and has the following structure: theory, applications, software tools, and how-to-do-it examples. An important part of the work is a collection of software tools - Machine Learning Office. Machine Learning Office tools were developed during last 15 years and was used both for many teaching courses, including international workshops in China, France, Italy, Ireland, Switzerland and for realizing fundamental and applied research projects. Case studies considered cover wide spectrum of the real-life low and high-dimensional geo- and environmental problems, such as air, soil and water pollution by radionuclides and heavy metals, soil types and hydro-geological units classification, decision-oriented mapping with uncertainties, natural hazards (landslides, avalanches) assessments and susceptibility mapping. Complementary tools useful for the exploratory data analysis and visualisation were developed as well. The software is user friendly and easy to use.
Resumo:
Global positioning systems (GPS) offer a cost-effective and efficient method to input and update transportation data. The spatial location of objects provided by GPS is easily integrated into geographic information systems (GIS). The storage, manipulation, and analysis of spatial data are also relatively simple in a GIS. However, many data storage and reporting methods at transportation agencies rely on linear referencing methods (LRMs); consequently, GPS data must be able to link with linear referencing. Unfortunately, the two systems are fundamentally incompatible in the way data are collected, integrated, and manipulated. In order for the spatial data collected using GPS to be integrated into a linear referencing system or shared among LRMs, a number of issues need to be addressed. This report documents and evaluates several of those issues and offers recommendations. In order to evaluate the issues associated with integrating GPS data with a LRM, a pilot study was created. To perform the pilot study, point features, a linear datum, and a spatial representation of a LRM were created for six test roadway segments that were located within the boundaries of the pilot study conducted by the Iowa Department of Transportation linear referencing system project team. Various issues in integrating point features with a LRM or between LRMs are discussed and recommendations provided. The accuracy of the GPS is discussed, including issues such as point features mapping to the wrong segment. Another topic is the loss of spatial information that occurs when a three-dimensional or two-dimensional spatial point feature is converted to a one-dimensional representation on a LRM. Recommendations such as storing point features as spatial objects if necessary or preserving information such as coordinates and elevation are suggested. The lack of spatial accuracy characteristic of most cartography, on which LRM are often based, is another topic discussed. The associated issues include linear and horizontal offset error. The final topic discussed is some of the issues in transferring point feature data between LRMs.
Resumo:
Especially in panel surveys, respondent attrition, respondent learning, and interviewer experience effects play a crucial role with respect to data quality. We examine three interview survey quality indicators in the same survey in a cross sectional as well as in a longitudinal way. In the cross sectional analysis we compare data quality in the mature original sample with that in a refreshment sample, surveyed in the same wave. Because in the same wave an interviewer survey was conducted, collecting attitudes on their socio demography, survey attitudes and burden measures, we are able to consider interviewer fixed effects as well. The longitudinal analysis gives more insight in the respondent learning effects with respect to the quality indicators considered by considering the very same respondents across waves. The Swiss Household Panel, a CATI survey representative of the Swiss residential population, forms an ideal modelling database: the interviewer - respondent assignment is random, both within and across waves. This design avoids possible confusion with other effects stemming from a non-random assignment of interviewers, e.g. area effects or effects from assigning the best interviewers to the hard cases. In order to separate interviewer, respondent and wave effects, we build cross-classified multilevel models.
Resumo:
Työn tavoittena oli selvittää, miten tietovarastointi voi tukea yrityksessä tapahtuvaa päätöksentekoa. Tietovarastokomponenttien ja –prosessien kuvauksen jälkeen on käsitelty tietovarastoprojektin eri vaiheita. Esitettyä teoriaa sovellettiin käytäntöön globaalissa metalliteollisuusyrityksessä, jossa tietovarastointikonseptia testattiin. Testauksen perusteella arvioitiin olemassa olevan tiedon tilaa sekä kahden käytetyn ohjelmiston toimivuutta tietovarastoinnissa. Yrityksen operatiivisten järjestelmien tiedon laadun todettiin olevan tutkituilta osin epäyhtenäistä ja puutteellista. Siksi tiedon suora yrityslaajuinen hyödyntäminen luotettavien ja hyvälaatuisten raporttien luonnissa on vaikeaa. Lisäksi eri yksiköiden välillä havaittiin epäyhtenäisyyttä käytettyjen liiketoiminnan käsitteiden sekä järjestelmien käyttötapojen suhteen. Testauksessa käytetyt ohjelmistot suoriutuivat perustietovarastoinnista hyvin, vaikkakin joitain rajoituksia ja erikoisuuksia ilmenikin. Työtä voidaan pitää ennen varsinaista tietovarastoprojektia tehtävänä esitutkimuksena. Jatkotoimenpiteinä ehdotetaan testauksen jatkamista nykyisillä työkaluilla kohdistaen tavoitteet konkreettisiin tuloksiin. Tiedon laadun tärkeyttä tulee korostaa koko organisaatiossa ja olemassa olevan tiedon laatua pitää parantaa tulevaisuudessa.
Resumo:
The enhanced functional sensitivity offered by ultra-high field imaging may significantly benefit simultaneous EEG-fMRI studies, but the concurrent increases in artifact contamination can strongly compromise EEG data quality. In the present study, we focus on EEG artifacts created by head motion in the static B0 field. A novel approach for motion artifact detection is proposed, based on a simple modification of a commercial EEG cap, in which four electrodes are non-permanently adapted to record only magnetic induction effects. Simultaneous EEG-fMRI data were acquired with this setup, at 7T, from healthy volunteers undergoing a reversing-checkerboard visual stimulation paradigm. Data analysis assisted by the motion sensors revealed that, after gradient artifact correction, EEG signal variance was largely dominated by pulse artifacts (81-93%), but contributions from spontaneous motion (4-13%) were still comparable to or even larger than those of actual neuronal activity (3-9%). Multiple approaches were tested to determine the most effective procedure for denoising EEG data incorporating motion sensor information. Optimal results were obtained by applying an initial pulse artifact correction step (AAS-based), followed by motion artifact correction (based on the motion sensors) and ICA denoising. On average, motion artifact correction (after AAS) yielded a 61% reduction in signal power and a 62% increase in VEP trial-by-trial consistency. Combined with ICA, these improvements rose to a 74% power reduction and an 86% increase in trial consistency. Overall, the improvements achieved were well appreciable at single-subject and single-trial levels, and set an encouraging quality mark for simultaneous EEG-fMRI at ultra-high field.
Resumo:
Coastal birds are an integral part of coastal ecosystems, which nowadays are subject to severe environmental pressures. Effective measures for the management and conservation of seabirds and their habitats call for insight into their population processes and the factors affecting their distribution and abundance. Central to national and international management and conservation measures is the availability of accurate data and information on bird populations, as well as on environmental trends and on measures taken to solve environmental problems. In this thesis I address different aspects of the occurrence, abundance, population trends and breeding success of waterbirds breeding on the Finnish coast of the Baltic Sea, and discuss the implications of the results for seabird monitoring, management and conservation. In addition, I assess the position and prospects of coastal bird monitoring data, in the processing and dissemination of biodiversity data and information in accordance with the Convention on Biological Diversity (CBD) and other national and international commitments. I show that important factors for seabird habitat selection are island area and elevation, water depth, shore openness, and the composition of island cover habitats. Habitat preferences are species-specific, with certain similarities within species groups. The occurrence of the colonial Arctic Tern (Sterna paradisaea) is partly affected by different habitat characteristics than its abundance. Using long-term bird monitoring data, I show that eutrophication and winter severity have reduced the populations of several Finnish seabird species. A major demographic factor through which environmental changes influence bird populations is breeding success. Breeding success can function as a more rapid indicator of sublethal environmental impacts than population trends, particularly for long-lived and slowbreeding species, and should therefore be included in coastal bird monitoring schemes. Among my target species, local breeding success can be shown to affect the populations of the Mallard (Anas platyrhynchos), the Eider (Somateria mollissima) and the Goosander (Mergus merganser) after a time lag corresponding to their species-specific recruitment age. For some of the target species, the number of individuals in late summer can be used as an easier and more cost-effective indicator of breeding success than brood counts. My results highlight that the interpretation and application of habitat and population studies require solid background knowledge of the ecology of the target species. In addition, the special characteristics of coastal birds, their habitats, and coastal bird monitoring data have to be considered in the assessment of their distribution and population trends. According to the results, the relationships between the occurrence, abundance and population trends of coastal birds and environmental factors can be quantitatively assessed using multivariate modelling and model selection. Spatial data sets widely available in Finland can be utilised in the calculation of several variables that are relevant to the habitat selection of Finnish coastal species. Concerning some habitat characteristics field work is still required, due to a lack of remotely sensed data or the low resolution of readily available data in relation to the fine scale of the habitat patches in the archipelago. While long-term data sets exist for water quality and weather, the lack of data concerning for instance the food resources of birds hampers more detailed studies of environmental effects on bird populations. Intensive studies of coastal bird species in different archipelago areas should be encouraged. The provision and free delivery of high-quality coastal data concerning bird populations and their habitats would greatly increase the capability of ecological modelling, as well as the management and conservation of coastal environments and communities. International initiatives that promote open spatial data infrastructures and sharing are therefore highly regarded. To function effectively, international information networks, such as the biodiversity Clearing House Mechanism (CHM) under the CBD, need to be rooted at regional and local levels. Attention should also be paid to the processing of data for higher levels of the information hierarchy, so that data are synthesized and developed into high-quality knowledge applicable to management and conservation.