848 resultados para Spatial data mining
Resumo:
The primary challenge in groundwater and contaminant transport modeling is obtaining the data needed for constructing, calibrating and testing the models. Large amounts of data are necessary for describing the hydrostratigraphy in areas with complex geology. Increasingly states are making spatial data available that can be used for input to groundwater flow models. The appropriateness of this data for large-scale flow systems has not been tested. This study focuses on modeling a plume of 1,4-dioxane in a heterogeneous aquifer system in Scio Township, Washtenaw County, Michigan. The analysis consisted of: (1) characterization of hydrogeology of the area and construction of a conceptual model based on publicly available spatial data, (2) development and calibration of a regional flow model for the site, (3) conversion of the regional model to a more highly resolved local model, (4) simulation of the dioxane plume, and (5) evaluation of the model's ability to simulate field data and estimation of the possible dioxane sources and subsequent migration until maximum concentrations are at or below the Michigan Department of Environmental Quality's residential cleanup standard for groundwater (85 ppb). MODFLOW-2000 and MT3D programs were utilized to simulate the groundwater flow and the development and movement of the 1, 4-dioxane plume, respectively. MODFLOW simulates transient groundwater flow in a quasi-3-dimensional sense, subject to a variety of boundary conditions that can simulate recharge, pumping, and surface-/groundwater interactions. MT3D simulates solute advection with groundwater flow (using the flow solution from MODFLOW), dispersion, source/sink mixing, and chemical reaction of contaminants. This modeling approach was successful at simulating the groundwater flows by calibrating recharge and hydraulic conductivities. The plume transport was adequately simulated using literature dispersivity and sorption coefficients, although the plume geometries were not well constrained.
Resumo:
Smart homes for the aging population have recently started attracting the attention of the research community. The "health state" of smart homes is comprised of many different levels; starting with the physical health of citizens, it also includes longer-term health norms and outcomes, as well as the arena of positive behavior changes. One of the problems of interest is to monitor the activities of daily living (ADL) of the elderly, aiming at their protection and well-being. For this purpose, we installed passive infrared (PIR) sensors to detect motion in a specific area inside a smart apartment and used them to collect a set of ADL. In a novel approach, we describe a technology that allows the ground truth collected in one smart home to train activity recognition systems for other smart homes. We asked the users to label all instances of all ADL only once and subsequently applied data mining techniques to cluster in-home sensor firings. Each cluster would therefore represent the instances of the same activity. Once the clusters were associated to their corresponding activities, our system was able to recognize future activities. To improve the activity recognition accuracy, our system preprocessed raw sensor data by identifying overlapping activities. To evaluate the recognition performance from a 200-day dataset, we implemented three different active learning classification algorithms and compared their performance: naive Bayesian (NB), support vector machine (SVM) and random forest (RF). Based on our results, the RF classifier recognized activities with an average specificity of 96.53%, a sensitivity of 68.49%, a precision of 74.41% and an F-measure of 71.33%, outperforming both the NB and SVM classifiers. Further clustering markedly improved the results of the RF classifier. An activity recognition system based on PIR sensors in conjunction with a clustering classification approach was able to detect ADL from datasets collected from different homes. Thus, our PIR-based smart home technology could improve care and provide valuable information to better understand the functioning of our societies, as well as to inform both individual and collective action in a smart city scenario.
Resumo:
In land systems, equitably managing trade-offs between planetary boundaries and human development needs represents a grand challenge in sustainability oriented initiatives. Informing such initiatives requires knowledge about the nexus between land use, poverty, and environment. This paper presents results from Lao PDR, where we combined nationwide spatial data on land use types and the environmental state of landscapes with village-level poverty indicators. Our analysis reveals two general but contrasting trends. First, landscapes with paddy or permanent agriculture allow a greater number of people to live in less poverty but come at the price of a decrease in natural vegetation cover. Second, people practising extensive swidden agriculture and living in intact environments are often better off than people in degraded paddy or permanent agriculture. As poverty rates within different landscape types vary more than between landscape types, we cannot stipulate a land use–poverty–environment nexus. However, the distinct spatial patterns or configurations of these rates point to other important factors at play. Drawing on ethnicity as a proximate factor for endogenous development potentials and accessibility as a proximate factor for external influences, we further explore these linkages. Ethnicity is strongly related to poverty in all land use types almost independently of accessibility, implying that social distance outweighs geographic or physical distance. In turn, accessibility, almost a precondition for poverty alleviation, is mainly beneficial to ethnic majority groups and people living in paddy or permanent agriculture. These groups are able to translate improved accessibility into poverty alleviation. Our results show that the concurrence of external influences with local—highly contextual—development potentials is key to shaping outcomes of the land use–poverty–environment nexus. By addressing such leverage points, these findings help guide more effective development interventions. At the same time, they point to the need in land change science to better integrate the understanding of place-based land indicators with process-based drivers of land use change.
Resumo:
A wide variety of spatial data collection efforts are ongoing throughout local, state and federal agencies, private firms and non-profit organizations. Each effort is established for a different purpose but organizations and individuals often collect and maintain the same or similar information. The United States federal government has undertaken many initiatives such as the National Spatial Data Infrastructure, the National Map and Geospatial One-Stop to reduce duplicative spatial data collection and promote the coordinated use, sharing, and dissemination of spatial data nationwide. A key premise in most of these initiatives is that no national government will be able to gather and maintain more than a small percentage of the geographic data that users want and desire. Thus, national initiatives depend typically on the cooperation of those already gathering spatial data and those using GIs to meet specific needs to help construct and maintain these spatial data infrastructures and geo-libraries for their nations (Onsrud 2001). Some of the impediments to widespread spatial data sharing are well known from directly asking GIs data producers why they are not currently involved in creating datasets that are of common or compatible formats, documenting their datasets in a standardized metadata format or making their datasets more readily available to others through Data Clearinghouses or geo-libraries. The research described in this thesis addresses the impediments to wide-scale spatial data sharing faced by GIs data producers and explores a new conceptual data-sharing approach, the Public Commons for Geospatial Data, that supports user-friendly metadata creation, open access licenses, archival services and documentation of parent lineage of the contributors and value- adders of digital spatial data sets.
Resumo:
This poster raises the issue of a research work oriented to the storage, retrieval, representation and analysis of dynamic GI, taking into account The ultimate objective is the modelling and representation of the dynamic nature of geographic features, establishing mechanisms to store geometries enriched with a temporal structure (regardless of space) and a set of semantic descriptors detailing and clarifying the nature of the represented features and their temporality. the semantic, the temporal and the spatiotemporal components. We intend to define a set of methods, rules and restrictions for the adequate integration of these components into the primary elements of the GI: theme, location, time [1]. We intend to establish and incorporate three new structures (layers) into the core of data storage by using mark-up languages: a semantictemporal structure, a geosemantic structure, and an incremental spatiotemporal structure. Thus, data would be provided with the capability of pinpointing and expressing their own basic and temporal characteristics, enabling them to interact each other according to their context, and their time and meaning relationships that could be eventually established
Resumo:
The Microarray technique is rather powerful, as it allows to test up thousands of genes at a time, but this produces an overwhelming set of data files containing huge amounts of data, which is quite difficult to pre-process, separate, classify and correlate for interesting conclusions to be extracted. Modern machine learning, data mining and clustering techniques based on information theory, are needed to read and interpret the information contents buried in those large data sets. Independent Component Analysis method can be used to correct the data affected by corruption processes or to filter the uncorrectable one and then clustering methods can group similar genes or classify samples. In this paper a hybrid approach is used to obtain a two way unsupervised clustering for a corrected microarray data.
Resumo:
Sensor networks are increasingly becoming one of the main sources of Big Data on the Web. However, the observations that they produce are made available with heterogeneous schemas, vocabularies and data formats, making it difficult to share and reuse these data for other purposes than those for which they were originally set up. In this thesis we address these challenges, considering how we can transform streaming raw data to rich ontology-based information that is accessible through continuous queries for streaming data. Our main contribution is an ontology-based approach for providing data access and query capabilities to streaming data sources, allowing users to express their needs at a conceptual level, independent of implementation and language-specific details. We introduce novel query rewriting and data translation techniques that rely on mapping definitions relating streaming data models to ontological concepts. Specific contributions include: • The syntax and semantics of the SPARQLStream query language for ontologybased data access, and a query rewriting approach for transforming SPARQLStream queries into streaming algebra expressions. • The design of an ontology-based streaming data access engine that can internally reuse an existing data stream engine, complex event processor or sensor middleware, using R2RML mappings for defining relationships between streaming data models and ontology concepts. Concerning the sensor metadata of such streaming data sources, we have investigated how we can use raw measurements to characterize streaming data, producing enriched data descriptions in terms of ontological models. Our specific contributions are: • A representation of sensor data time series that captures gradient information that is useful to characterize types of sensor data. • A method for classifying sensor data time series and determining the type of data, using data mining techniques, and a method for extracting semantic sensor metadata features from the time series.
Resumo:
In the last decade, complex networks have widely been applied to the study of many natural and man-made systems, and to the extraction of meaningful information from the interaction structures created by genes and proteins. Nevertheless, less attention has been devoted to metabonomics, due to the lack of a natural network representation of spectral data. Here we define a technique for reconstructing networks from spectral data sets, where nodes represent spectral bins, and pairs of them are connected when their intensities follow a pattern associated with a disease. The structural analysis of the resulting network can then be used to feed standard data-mining algorithms, for instance for the classification of new (unlabeled) subjects. Furthermore, we show how the structure of the network is resilient to the presence of external additive noise, and how it can be used to extract relevant knowledge about the development of the disease.
Resumo:
Hoy en día, con la evolución continua y rápida de las tecnologías de la información y los dispositivos de computación, se recogen y almacenan continuamente grandes volúmenes de datos en distintos dominios y a través de diversas aplicaciones del mundo real. La extracción de conocimiento útil de una cantidad tan enorme de datos no se puede realizar habitualmente de forma manual, y requiere el uso de técnicas adecuadas de aprendizaje automático y de minería de datos. La clasificación es una de las técnicas más importantes que ha sido aplicada con éxito a varias áreas. En general, la clasificación se compone de dos pasos principales: en primer lugar, aprender un modelo de clasificación o clasificador a partir de un conjunto de datos de entrenamiento, y en segundo lugar, clasificar las nuevas instancias de datos utilizando el clasificador aprendido. La clasificación es supervisada cuando todas las etiquetas están presentes en los datos de entrenamiento (es decir, datos completamente etiquetados), semi-supervisada cuando sólo algunas etiquetas son conocidas (es decir, datos parcialmente etiquetados), y no supervisada cuando todas las etiquetas están ausentes en los datos de entrenamiento (es decir, datos no etiquetados). Además, aparte de esta taxonomía, el problema de clasificación se puede categorizar en unidimensional o multidimensional en función del número de variables clase, una o más, respectivamente; o también puede ser categorizado en estacionario o cambiante con el tiempo en función de las características de los datos y de la tasa de cambio subyacente. A lo largo de esta tesis, tratamos el problema de clasificación desde tres perspectivas diferentes, a saber, clasificación supervisada multidimensional estacionaria, clasificación semisupervisada unidimensional cambiante con el tiempo, y clasificación supervisada multidimensional cambiante con el tiempo. Para llevar a cabo esta tarea, hemos usado básicamente los clasificadores Bayesianos como modelos. La primera contribución, dirigiéndose al problema de clasificación supervisada multidimensional estacionaria, se compone de dos nuevos métodos de aprendizaje de clasificadores Bayesianos multidimensionales a partir de datos estacionarios. Los métodos se proponen desde dos puntos de vista diferentes. El primer método, denominado CB-MBC, se basa en una estrategia de envoltura de selección de variables que es voraz y hacia delante, mientras que el segundo, denominado MB-MBC, es una estrategia de filtrado de variables con una aproximación basada en restricciones y en el manto de Markov. Ambos métodos han sido aplicados a dos problemas reales importantes, a saber, la predicción de los inhibidores de la transcriptasa inversa y de la proteasa para el problema de infección por el virus de la inmunodeficiencia humana tipo 1 (HIV-1), y la predicción del European Quality of Life-5 Dimensions (EQ-5D) a partir de los cuestionarios de la enfermedad de Parkinson con 39 ítems (PDQ-39). El estudio experimental incluye comparaciones de CB-MBC y MB-MBC con los métodos del estado del arte de la clasificación multidimensional, así como con métodos comúnmente utilizados para resolver el problema de predicción de la enfermedad de Parkinson, a saber, la regresión logística multinomial, mínimos cuadrados ordinarios, y mínimas desviaciones absolutas censuradas. En ambas aplicaciones, los resultados han sido prometedores con respecto a la precisión de la clasificación, así como en relación al análisis de las estructuras gráficas que identifican interacciones conocidas y novedosas entre las variables. La segunda contribución, referida al problema de clasificación semi-supervisada unidimensional cambiante con el tiempo, consiste en un método nuevo (CPL-DS) para clasificar flujos de datos parcialmente etiquetados. Los flujos de datos difieren de los conjuntos de datos estacionarios en su proceso de generación muy rápido y en su aspecto de cambio de concepto. Es decir, los conceptos aprendidos y/o la distribución subyacente están probablemente cambiando y evolucionando en el tiempo, lo que hace que el modelo de clasificación actual sea obsoleto y deba ser actualizado. CPL-DS utiliza la divergencia de Kullback-Leibler y el método de bootstrapping para cuantificar y detectar tres tipos posibles de cambio: en las predictoras, en la a posteriori de la clase o en ambas. Después, si se detecta cualquier cambio, un nuevo modelo de clasificación se aprende usando el algoritmo EM; si no, el modelo de clasificación actual se mantiene sin modificaciones. CPL-DS es general, ya que puede ser aplicado a varios modelos de clasificación. Usando dos modelos diferentes, el clasificador naive Bayes y la regresión logística, CPL-DS se ha probado con flujos de datos sintéticos y también se ha aplicado al problema real de la detección de código malware, en el cual los nuevos ficheros recibidos deben ser continuamente clasificados en malware o goodware. Los resultados experimentales muestran que nuestro método es efectivo para la detección de diferentes tipos de cambio a partir de los flujos de datos parcialmente etiquetados y también tiene una buena precisión de la clasificación. Finalmente, la tercera contribución, sobre el problema de clasificación supervisada multidimensional cambiante con el tiempo, consiste en dos métodos adaptativos, a saber, Locally Adpative-MB-MBC (LA-MB-MBC) y Globally Adpative-MB-MBC (GA-MB-MBC). Ambos métodos monitorizan el cambio de concepto a lo largo del tiempo utilizando la log-verosimilitud media como métrica y el test de Page-Hinkley. Luego, si se detecta un cambio de concepto, LA-MB-MBC adapta el actual clasificador Bayesiano multidimensional localmente alrededor de cada nodo cambiado, mientras que GA-MB-MBC aprende un nuevo clasificador Bayesiano multidimensional. El estudio experimental realizado usando flujos de datos sintéticos multidimensionales indica los méritos de los métodos adaptativos propuestos. ABSTRACT Nowadays, with the ongoing and rapid evolution of information technology and computing devices, large volumes of data are continuously collected and stored in different domains and through various real-world applications. Extracting useful knowledge from such a huge amount of data usually cannot be performed manually, and requires the use of adequate machine learning and data mining techniques. Classification is one of the most important techniques that has been successfully applied to several areas. Roughly speaking, classification consists of two main steps: first, learn a classification model or classifier from an available training data, and secondly, classify the new incoming unseen data instances using the learned classifier. Classification is supervised when the whole class values are present in the training data (i.e., fully labeled data), semi-supervised when only some class values are known (i.e., partially labeled data), and unsupervised when the whole class values are missing in the training data (i.e., unlabeled data). In addition, besides this taxonomy, the classification problem can be categorized into uni-dimensional or multi-dimensional depending on the number of class variables, one or more, respectively; or can be also categorized into stationary or streaming depending on the characteristics of the data and the rate of change underlying it. Through this thesis, we deal with the classification problem under three different settings, namely, supervised multi-dimensional stationary classification, semi-supervised unidimensional streaming classification, and supervised multi-dimensional streaming classification. To accomplish this task, we basically used Bayesian network classifiers as models. The first contribution, addressing the supervised multi-dimensional stationary classification problem, consists of two new methods for learning multi-dimensional Bayesian network classifiers from stationary data. They are proposed from two different points of view. The first method, named CB-MBC, is based on a wrapper greedy forward selection approach, while the second one, named MB-MBC, is a filter constraint-based approach based on Markov blankets. Both methods are applied to two important real-world problems, namely, the prediction of the human immunodeficiency virus type 1 (HIV-1) reverse transcriptase and protease inhibitors, and the prediction of the European Quality of Life-5 Dimensions (EQ-5D) from 39-item Parkinson’s Disease Questionnaire (PDQ-39). The experimental study includes comparisons of CB-MBC and MB-MBC against state-of-the-art multi-dimensional classification methods, as well as against commonly used methods for solving the Parkinson’s disease prediction problem, namely, multinomial logistic regression, ordinary least squares, and censored least absolute deviations. For both considered case studies, results are promising in terms of classification accuracy as well as regarding the analysis of the learned MBC graphical structures identifying known and novel interactions among variables. The second contribution, addressing the semi-supervised uni-dimensional streaming classification problem, consists of a novel method (CPL-DS) for classifying partially labeled data streams. Data streams differ from the stationary data sets by their highly rapid generation process and their concept-drifting aspect. That is, the learned concepts and/or the underlying distribution are likely changing and evolving over time, which makes the current classification model out-of-date requiring to be updated. CPL-DS uses the Kullback-Leibler divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classification model is learned using the expectation-maximization algorithm; otherwise, the current classification model is kept unchanged. CPL-DS is general as it can be applied to several classification models. Using two different models, namely, naive Bayes classifier and logistic regression, CPL-DS is tested with synthetic data streams and applied to the real-world problem of malware detection, where the new received files should be continuously classified into malware or goodware. Experimental results show that our approach is effective for detecting different kinds of drift from partially labeled data streams, as well as having a good classification performance. Finally, the third contribution, addressing the supervised multi-dimensional streaming classification problem, consists of two adaptive methods, namely, Locally Adaptive-MB-MBC (LA-MB-MBC) and Globally Adaptive-MB-MBC (GA-MB-MBC). Both methods monitor the concept drift over time using the average log-likelihood score and the Page-Hinkley test. Then, if a drift is detected, LA-MB-MBC adapts the current multi-dimensional Bayesian network classifier locally around each changed node, whereas GA-MB-MBC learns a new multi-dimensional Bayesian network classifier from scratch. Experimental study carried out using synthetic multi-dimensional data streams shows the merits of both proposed adaptive methods.
Resumo:
Light Detection and Ranging (LIDAR) provides high horizontal and vertical resolution of spatial data located in point cloud images, and is increasingly being used in a number of applications and disciplines, which have concentrated on the exploit and manipulation of the data using mainly its three dimensional nature. Bathymetric LIDAR systems and data are mainly focused to map depths in shallow and clear waters with a high degree of accuracy. Additionally, the backscattering produced by the different materials distributed over the bottom surface causes that the returned intensity signal contains important information about the reflection properties of these materials. Processing conveniently these values using a Simplified Radiative Transfer Model, allows the identification of different sea bottom types. This paper presents an original method for the classification of sea bottom by means of information processing extracted from the images generated through LIDAR data. The results are validated using a vector database containing benthic information derived by marine surveys.
Resumo:
Este trabajo, «Una aproximación a Ia integración en Open Data de los recursos Inspire de Ia IDEE », tiene por objetivo el construir un puente entre las Infraestructuras de Datos Espaciales (IDE) y el mundo de los «datos abiertos » aprovechando el marco legal de la Reutilización de la Información del Sector Público (RISP). Tras analizar qué es RISP y en particular los datos abiertos, y cómo se implementa en distintas Administraciones, se estudian los requisitos técnicos y legales necesarios para construir el «traductor» que permita canalizar la información IDE en el portal central de reutilización de información español datos.gob.es, dando una mayor visibilidad a los recursos INSPIRE. El trabajo se centra específicamente en dos puntos: en primer lugar en proporcionar y documentar la solución técnica que sirva en primera instancia para que el Instituto Geográfico Nacional aporte con más eficiencia sus recursos a datos.gob.es. En segundo lugar, a estudiar la aplicabilidad de esta misma solución al ámbito de la IDE de España (IDEE), señalando problemas detectados en el análisis de su contenido y sugiriendo recomendaciones para minimizar los problemas de su potencial reutilización. ABSTRACT: This work titled «Analysis of the integration of INSPIRE resources coming from Spanish Spatial Data Infrastructure within the National Public Sector Information portal», aims to build a bridge between the Spatial Data Infrastructures (SDI ) and the world of "Open Data" taking advantage of the legal framework on the Re-use of Public Sector Information (PSI) . After analyzing what PSI reuse and Open Data is and how it is implemented by different administrations, a study to extract the technical and legal requirements is done to build the "translator" that will allow adding SDI resources within the Spanish portal for the PSI reuse data .gob.es while giving greater visibility to INSPIRE. This document specifically focuses on two aspects: first to provide and document the technical solution that serves primarily for the National Geographic Institute to supply more efficiently its resources to datos.gob.es. Secondly, to study the applicability of the proposed solution to the whole Spanish SDI (IDEE), noting identified problems and suggesting recommendations to minimize problems of its potential reuse.