901 resultados para Data dissemination and sharing
Resumo:
Hoy en día, con la evolución continua y rápida de las tecnologías de la información y los dispositivos de computación, se recogen y almacenan continuamente grandes volúmenes de datos en distintos dominios y a través de diversas aplicaciones del mundo real. La extracción de conocimiento útil de una cantidad tan enorme de datos no se puede realizar habitualmente de forma manual, y requiere el uso de técnicas adecuadas de aprendizaje automático y de minería de datos. La clasificación es una de las técnicas más importantes que ha sido aplicada con éxito a varias áreas. En general, la clasificación se compone de dos pasos principales: en primer lugar, aprender un modelo de clasificación o clasificador a partir de un conjunto de datos de entrenamiento, y en segundo lugar, clasificar las nuevas instancias de datos utilizando el clasificador aprendido. La clasificación es supervisada cuando todas las etiquetas están presentes en los datos de entrenamiento (es decir, datos completamente etiquetados), semi-supervisada cuando sólo algunas etiquetas son conocidas (es decir, datos parcialmente etiquetados), y no supervisada cuando todas las etiquetas están ausentes en los datos de entrenamiento (es decir, datos no etiquetados). Además, aparte de esta taxonomía, el problema de clasificación se puede categorizar en unidimensional o multidimensional en función del número de variables clase, una o más, respectivamente; o también puede ser categorizado en estacionario o cambiante con el tiempo en función de las características de los datos y de la tasa de cambio subyacente. A lo largo de esta tesis, tratamos el problema de clasificación desde tres perspectivas diferentes, a saber, clasificación supervisada multidimensional estacionaria, clasificación semisupervisada unidimensional cambiante con el tiempo, y clasificación supervisada multidimensional cambiante con el tiempo. Para llevar a cabo esta tarea, hemos usado básicamente los clasificadores Bayesianos como modelos. La primera contribución, dirigiéndose al problema de clasificación supervisada multidimensional estacionaria, se compone de dos nuevos métodos de aprendizaje de clasificadores Bayesianos multidimensionales a partir de datos estacionarios. Los métodos se proponen desde dos puntos de vista diferentes. El primer método, denominado CB-MBC, se basa en una estrategia de envoltura de selección de variables que es voraz y hacia delante, mientras que el segundo, denominado MB-MBC, es una estrategia de filtrado de variables con una aproximación basada en restricciones y en el manto de Markov. Ambos métodos han sido aplicados a dos problemas reales importantes, a saber, la predicción de los inhibidores de la transcriptasa inversa y de la proteasa para el problema de infección por el virus de la inmunodeficiencia humana tipo 1 (HIV-1), y la predicción del European Quality of Life-5 Dimensions (EQ-5D) a partir de los cuestionarios de la enfermedad de Parkinson con 39 ítems (PDQ-39). El estudio experimental incluye comparaciones de CB-MBC y MB-MBC con los métodos del estado del arte de la clasificación multidimensional, así como con métodos comúnmente utilizados para resolver el problema de predicción de la enfermedad de Parkinson, a saber, la regresión logística multinomial, mínimos cuadrados ordinarios, y mínimas desviaciones absolutas censuradas. En ambas aplicaciones, los resultados han sido prometedores con respecto a la precisión de la clasificación, así como en relación al análisis de las estructuras gráficas que identifican interacciones conocidas y novedosas entre las variables. La segunda contribución, referida al problema de clasificación semi-supervisada unidimensional cambiante con el tiempo, consiste en un método nuevo (CPL-DS) para clasificar flujos de datos parcialmente etiquetados. Los flujos de datos difieren de los conjuntos de datos estacionarios en su proceso de generación muy rápido y en su aspecto de cambio de concepto. Es decir, los conceptos aprendidos y/o la distribución subyacente están probablemente cambiando y evolucionando en el tiempo, lo que hace que el modelo de clasificación actual sea obsoleto y deba ser actualizado. CPL-DS utiliza la divergencia de Kullback-Leibler y el método de bootstrapping para cuantificar y detectar tres tipos posibles de cambio: en las predictoras, en la a posteriori de la clase o en ambas. Después, si se detecta cualquier cambio, un nuevo modelo de clasificación se aprende usando el algoritmo EM; si no, el modelo de clasificación actual se mantiene sin modificaciones. CPL-DS es general, ya que puede ser aplicado a varios modelos de clasificación. Usando dos modelos diferentes, el clasificador naive Bayes y la regresión logística, CPL-DS se ha probado con flujos de datos sintéticos y también se ha aplicado al problema real de la detección de código malware, en el cual los nuevos ficheros recibidos deben ser continuamente clasificados en malware o goodware. Los resultados experimentales muestran que nuestro método es efectivo para la detección de diferentes tipos de cambio a partir de los flujos de datos parcialmente etiquetados y también tiene una buena precisión de la clasificación. Finalmente, la tercera contribución, sobre el problema de clasificación supervisada multidimensional cambiante con el tiempo, consiste en dos métodos adaptativos, a saber, Locally Adpative-MB-MBC (LA-MB-MBC) y Globally Adpative-MB-MBC (GA-MB-MBC). Ambos métodos monitorizan el cambio de concepto a lo largo del tiempo utilizando la log-verosimilitud media como métrica y el test de Page-Hinkley. Luego, si se detecta un cambio de concepto, LA-MB-MBC adapta el actual clasificador Bayesiano multidimensional localmente alrededor de cada nodo cambiado, mientras que GA-MB-MBC aprende un nuevo clasificador Bayesiano multidimensional. El estudio experimental realizado usando flujos de datos sintéticos multidimensionales indica los méritos de los métodos adaptativos propuestos. ABSTRACT Nowadays, with the ongoing and rapid evolution of information technology and computing devices, large volumes of data are continuously collected and stored in different domains and through various real-world applications. Extracting useful knowledge from such a huge amount of data usually cannot be performed manually, and requires the use of adequate machine learning and data mining techniques. Classification is one of the most important techniques that has been successfully applied to several areas. Roughly speaking, classification consists of two main steps: first, learn a classification model or classifier from an available training data, and secondly, classify the new incoming unseen data instances using the learned classifier. Classification is supervised when the whole class values are present in the training data (i.e., fully labeled data), semi-supervised when only some class values are known (i.e., partially labeled data), and unsupervised when the whole class values are missing in the training data (i.e., unlabeled data). In addition, besides this taxonomy, the classification problem can be categorized into uni-dimensional or multi-dimensional depending on the number of class variables, one or more, respectively; or can be also categorized into stationary or streaming depending on the characteristics of the data and the rate of change underlying it. Through this thesis, we deal with the classification problem under three different settings, namely, supervised multi-dimensional stationary classification, semi-supervised unidimensional streaming classification, and supervised multi-dimensional streaming classification. To accomplish this task, we basically used Bayesian network classifiers as models. The first contribution, addressing the supervised multi-dimensional stationary classification problem, consists of two new methods for learning multi-dimensional Bayesian network classifiers from stationary data. They are proposed from two different points of view. The first method, named CB-MBC, is based on a wrapper greedy forward selection approach, while the second one, named MB-MBC, is a filter constraint-based approach based on Markov blankets. Both methods are applied to two important real-world problems, namely, the prediction of the human immunodeficiency virus type 1 (HIV-1) reverse transcriptase and protease inhibitors, and the prediction of the European Quality of Life-5 Dimensions (EQ-5D) from 39-item Parkinson’s Disease Questionnaire (PDQ-39). The experimental study includes comparisons of CB-MBC and MB-MBC against state-of-the-art multi-dimensional classification methods, as well as against commonly used methods for solving the Parkinson’s disease prediction problem, namely, multinomial logistic regression, ordinary least squares, and censored least absolute deviations. For both considered case studies, results are promising in terms of classification accuracy as well as regarding the analysis of the learned MBC graphical structures identifying known and novel interactions among variables. The second contribution, addressing the semi-supervised uni-dimensional streaming classification problem, consists of a novel method (CPL-DS) for classifying partially labeled data streams. Data streams differ from the stationary data sets by their highly rapid generation process and their concept-drifting aspect. That is, the learned concepts and/or the underlying distribution are likely changing and evolving over time, which makes the current classification model out-of-date requiring to be updated. CPL-DS uses the Kullback-Leibler divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classification model is learned using the expectation-maximization algorithm; otherwise, the current classification model is kept unchanged. CPL-DS is general as it can be applied to several classification models. Using two different models, namely, naive Bayes classifier and logistic regression, CPL-DS is tested with synthetic data streams and applied to the real-world problem of malware detection, where the new received files should be continuously classified into malware or goodware. Experimental results show that our approach is effective for detecting different kinds of drift from partially labeled data streams, as well as having a good classification performance. Finally, the third contribution, addressing the supervised multi-dimensional streaming classification problem, consists of two adaptive methods, namely, Locally Adaptive-MB-MBC (LA-MB-MBC) and Globally Adaptive-MB-MBC (GA-MB-MBC). Both methods monitor the concept drift over time using the average log-likelihood score and the Page-Hinkley test. Then, if a drift is detected, LA-MB-MBC adapts the current multi-dimensional Bayesian network classifier locally around each changed node, whereas GA-MB-MBC learns a new multi-dimensional Bayesian network classifier from scratch. Experimental study carried out using synthetic multi-dimensional data streams shows the merits of both proposed adaptive methods.
Resumo:
Replication Data Management (RDM) aims at enabling the use of data collections from several iterations of an experiment. However, there are several major challenges to RDM from integrating data models and data from empirical study infrastructures that were not designed to cooperate, e.g., data model variation of local data sources. [Objective] In this paper we analyze RDM needs and evaluate conceptual RDM approaches to support replication researchers. [Method] We adapted the ATAM evaluation process to (a) analyze RDM use cases and needs of empirical replication study research groups and (b) compare three conceptual approaches to address these RDM needs: central data repositories with a fixed data model, heterogeneous local repositories, and an empirical ecosystem. [Results] While the central and local approaches have major issues that are hard to resolve in practice, the empirical ecosystem allows bridging current gaps in RDM from heterogeneous data sources. [Conclusions] The empirical ecosystem approach should be explored in diverse empirical environments.
Resumo:
Effective data summarization methods that use AI techniques can help humans understand large sets of data. In this paper, we describe a knowledge-based method for automatically generating summaries of geospatial and temporal data, i.e. data with geographical and temporal references. The method is useful for summarizing data streams, such as GPS traces and traffic information, that are becoming more prevalent with the increasing use of sensors in computing devices. The method presented here is an initial architecture for our ongoing research in this domain. In this paper we describe the data representations we have designed for our method, our implementations of components to perform data abstraction and natural language generation. We also discuss evaluation results that show the ability of our method to generate certain types of geospatial and temporal descriptions.
Resumo:
The crop simulation model AquaCrop, recently developed by FAO can be used for a wide range of purposes. However, in its present form, its use over large areas or for applications that require a large number of simulations runs (e.g., long-term analysis), is not practical without developing software to facilitate such applications. Two tools for managing the inputs and outputs of AquaCrop, named AquaData and AquaGIS, have been developed for this purpose and are presented here. Both software utilities have been programmed in Delphi v. 5 and in addition, AquaGIS requires the Geographic Information System (GIS) programming tool MapObjects. These utilities allow the efficient management of input and output files, along with a GIS module to develop spatial analysis and effect spatial visualization of the results, facilitating knowledge dissemination. A sample of application of the utilities is given here, as an AquaCrop simulation analysis of impact of climate change on wheat yield in Southern Spain, which requires extensive input data preparation and output processing. The use of AquaCrop without the two utilities would have required approximately 1000 h of work, while the utilization of AquaData and AquaGIS reduced that time by more than 99%. Furthermore, the use of GIS, made it possible to perform a spatial analysis of the results, thus providing a new option to extend the use of the AquaCrop model to scales requiring spatial and temporal analyses.
Resumo:
Esta monografía presenta los fundamentos, contexto y detalles técnicos de un Esquema de Aplicación para la incorporación de datos espaciales relativos al patrimonio cultural en el marco definido por la directiva europea INSPIRE sobre información geográfica. Abstract: This monograph presents the background, context and technical details of an Application Schema for the inclusion of cultural heritage spatial data into the INSPIRE framework. Nowadays, INSPIRE provides the most relevant framework for the dissemination and exchange of geographical data, covering many different thematic fields, particularly relevant for envi-ronmental datasets. Although cultural heritage elements are partially addressed within INSPIRE, there is no specific documentation on how these data should be considered, structured and published. This text aims to provide technical guidelines for decision makers, public administrations and the scientific community for the definition and implementation of harmonized datasets for cultural heritage, according to the interoperability principles of INSPIRE.
Resumo:
Un Service Business Framework consiste en una serie de componentes interrelacionados que permiten la gestión de servicios de negocio a través de su ciclo de vida, desde su creación, descubrimiento y comparación, hasta su monetización (incluyendo un posible reparto de beneficios). De esta manera, el denominado FIWARE Business Framework trata de permitir a los usuarios de la plataforma FIWARE mejorar sus productos con funcionalidades de búsqueda, describrimiento, comparación, monetización y reparto de beneficios. Para lograr este objetivo, el Business Framework de FIWARE proporciona la especificación abierta y las APIs de una serie de components (denominados \Generic Enablers" en terminología FIWARE), junto con una implementación de referencia de las mismas pueden ser facilmente integradas en los sitemas existentes para conseguir aplicaciones con valor a~nadido. Al comienzo de este trabajo de fin de master, el Business Framework de FIWARE no era lo suficientemente maduro como para cubrir los requisitos de sus usuarios, ya que ofrecía modelos demasiado generales y dejaba algunas funcionalidades clave para ser implementadas por los usuarios. Para solucionar estos problemas, el principal objectivo desarrollado en el contexto de este trabajo de fin de master ha consistido en mejorar y evolucionar el Business Framework de FIWARE para dar respuesta a las demandas de sus usuarios. Para alcanzar el pricipal objetivo propuesto, el Business Framework de FIWARE ha sido evaluado usando la información proporcionada por los usuarios de la plataforma, principalmente PyMEs y start-ups que usan este framework en sus soluciones, con el objetivo de obtener una lista de requisitos y de dise~nar a partir de éstos un roadmap de evolución a 6 meses. Después, los diferentes problemas identificados se han tratado uno por uno dando en cada caso una solución capaz de cubrir los requisitos de los usuarios. Finalmente, se han evaluado los resultados obtenidos en el proyecto integrando el Business Framework desarrollado con un sistema existente para la gestión de datos de consusmo energético, construyendo lo que se ha denominado Mercado de Datos de Consumo Energético. Esto además ha permitido demostrar la utilidad del framework propuesto para evolucionar una plataforma de datos abiertos bien conocida como es CKAN a un verdadero mercado de datos.---ABSTRACT---Service Business Frameworks consist on a number of interrelated components that support the management of business services across their whole lifecycle, from their creation, publication, discovery and comparison, to their monetization (possibly including revenue settlement and sharing). In this regard, the FIWARE Business Framework aims at allowing FIWARE users to enhance their solutions with search, discovery, comparison, monetization and revenue settlement and sharing features. To achieve this objective, the FIWARE Business Framework provides the open specification and APIs of a comprehensive set of components (called Generic Enablers in FIWARE terminology), along with a reference implementation of these APIs,, that can be easily integrated with existing systems in order to create value added applications. At the beginning of the current Master's Thesis, the FIWARE Business Framework was not mature enough to cover the requirements of the its users, since it provided too general models and leaved some key functionality to be implemented by those users. To deal with these issues, the main objective carried out in the context of this Master's Thesis have been enhancing and evolving the FIWARE Business Framework to accomplish with the demands of its users. For achieving the main objective of this Master's Thesis, the FWARE Business Framework has been evaluated using the feedback provided by FIWARE users, mainly SMEs and start-ups, actually using the framework in their solutions, in order to determine a list of requirements and to design a roadmap for the evolution and improvement of the existing framework in the next 6 months. Then, the diferent issues detected have been tackle one by one enhancing them, and trying to give a solution able to cover users requirements. Finally, the results of the project have been evaluated by integrating the evolved FIWARE Business Framework with an existing system in charge of the management of energy consumption data, building what has been called the Energy Consumption Data Market. This has also allowed demonstrating the usefulness of the proposed business framework to evolve CKAN, a renowned open data platform, into an actual, fully- edged data market.
Resumo:
With the continuous development in the fields of sensors, advanced data processing and communications, road transport oriented intelligent applications and services have reached a significant maturity and complexity. Cooperative ITS services, based on the idea of sharing accurate information among road entities, are currently being tested on a large scale by different initiatives. The field operational test (FOTsis) project contributes to the deployment environment with services that involve a significant number of entities out of the vehicle. This made necessary the specification of an architecture which, based on the ISO ITS station reference architecture for communications, could support the requirements of the services proposed in the project. During the project, internal implementation tests and external interoperability tests have resulted in the validation of the proposed architecture. At the same time, these tests have had as a result the awareness of areas in which the FOTsis architecture could be completed, mainly to take full advantage of all the emerging and foreseeable data sources which may be relevant in the road environment. In this study, the authors will outline an approach that, based on the current cooperative ITS architecture and the SmartCities and Internet Of Things (IoT) architectures, can provide a common convergence platform to maximise the information available for ITS purposes.
Resumo:
Background: Semantic Web technologies have been widely applied in the life sciences, for example by data providers such as OpenLifeData and through web services frameworks such as SADI. The recently reported OpenLifeData2SADI project offers access to the vast OpenLifeData data store through SADI services. Findings: This article describes how to merge data retrieved from OpenLifeData2SADI with other SADI services using the Galaxy bioinformatics analysis platform, thus making this semantic data more amenable to complex analyses. This is demonstrated using a working example, which is made distributable and reproducible through a Docker image that includes SADI tools, along with the data and workflows that constitute the demonstration. Conclusions: The combination of Galaxy and Docker offers a solution for faithfully reproducing and sharing complex data retrieval and analysis workflows based on the SADI Semantic web service design patterns.
Resumo:
Funding The International Primary Care Respiratory Group (IPCRG) provided funding for this research project as an UNLOCK group study for which the funding was obtained through an unrestricted grant by Novartis AG, Basel, Switzerland. The latter funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Database access for the OPCRD was provided by the Respiratory Effectiveness Group (REG) and Research in Real Life; the OPCRD statistical analysis was funded by REG. The Bocholtz Study was funded by PICASSO for COPD, an initiative of Boehringer Ingelheim, Pfizer and the Caphri Research Institute, Maastricht University, The Netherlands.
Resumo:
This work was financially supported by the German Federal Ministry of Food and Agriculture (BMEL) through the Federal Office for Agriculture and Food (BLE), (2851ERA01J). FT and RPR were supported by FACCE MACSUR (3200009600) through the Finnish Ministry of Agriculture and Forestry (MMM). EC, HE and EL were supported by The Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning (220-2007-1218) and by the strategic funding ‘Soil-Water-Landscape’ from the faculty of Natural Resources and Agricultural Sciences (Swedish University of Agricultural Sciences) and thank professor P-E Jansson (Royal Institute of Technology, Stockholm) for support. JC, HR and DW thank the INRA ACCAF metaprogramm for funding and Eric Casellas from UR MIAT INRA for support. CB was funded by the Helmholtz project “REKLIM—Regional Climate Change”. CK was funded by the HGF Alliance “Remote Sensing and Earth System Dynamics” (EDA). FH was funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) under the Grant FOR1695. FE and SS acknowledge support by the German Science Foundation (project EW 119/5-1). HH, GZ, SS, TG and FE thank Andreas Enders and Gunther Krauss (INRES, University of Bonn) for support. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Resumo:
BodyMap is a human and mouse gene expression database that is based on site-directed 3′-expressed sequence tags generated at Osaka University. To date, it contains more than 300 000 tag sequences from 64 human and 39 mouse tissues. For the recent release, the precise anatomical expression patterns for more than half of the human gene entries were generated by introduced amplified fragment length polymorphism (iAFLP), which is a PCR-based high-throughput expression profiling method. The iAFLP data incorporated into BodyMap describe the relative contents of more than 12 000 transcripts across 30 tissue RNAs. In addition, a newly developed gene ranking system helps users obtain lists of genes that have desired expression patterns according to their significance. BodyMap supports complete transfer of unique data sets and provides analysis that is accessible through the WWW at http://bodymap.ims.u-tokyo.ac.jp.
Resumo:
Tide gauge (TG) data along the northern Mediterranean and Black Sea coasts are compared to the sea-surface height (SSH) anomaly obtained from ocean altimetry (TOPEX/Poseidon and ERS-1/2) for a period of nine years (1993–2001). The TG measures the SSH relative to the ground whereas the altimetry does so with respect to the geocentric reference frame; therefore their difference would be in principle a vertical ground motion of the TG sites, though there are different error sources for this estimate as is discussed in the paper. In this study we estimate such vertical ground motion, for each TG site, from the slope of the SSH time series of the (non-seasonal) difference between the TG record and the altimetry measurement at a point closest to the TG. Where possible, these estimates are further compared with those derived from nearby continuous Global Positioning System (GPS) data series. These results on vertical ground motion along the Mediterranean and Black Sea coasts provide useful source data for studying, contrasting, and constraining tectonic models of the region. For example, in the eastern coast of the Adriatic Sea and in the western coast of Greece, a general subsidence is observed which may be related to the Adriatic lithosphere subducting beneath the Eurasian plate along the Dinarides fault.
Open business intelligence: on the importance of data quality awareness in user-friendly data mining
Resumo:
Citizens demand more and more data for making decisions in their daily life. Therefore, mechanisms that allow citizens to understand and analyze linked open data (LOD) in a user-friendly manner are highly required. To this aim, the concept of Open Business Intelligence (OpenBI) is introduced in this position paper. OpenBI facilitates non-expert users to (i) analyze and visualize LOD, thus generating actionable information by means of reporting, OLAP analysis, dashboards or data mining; and to (ii) share the new acquired information as LOD to be reused by anyone. One of the most challenging issues of OpenBI is related to data mining, since non-experts (as citizens) need guidance during preprocessing and application of mining algorithms due to the complexity of the mining process and the low quality of the data sources. This is even worst when dealing with LOD, not only because of the different kind of links among data, but also because of its high dimensionality. As a consequence, in this position paper we advocate that data mining for OpenBI requires data quality-aware mechanisms for guiding non-expert users in obtaining and sharing the most reliable knowledge from the available LOD.
Resumo:
The environmental, cultural and socio-economic causes and consequences of farmland abandonment are issues of increasing concern for researchers and policy makers. In previous studies, we proposed a new methodology for selecting the driving factors in farmland abandonment processes. Using Data Mining and GIS, it is possible to select those variables which are more significantly related to abandonment. The aim of this study is to investigate the application of the above mentioned methodology for finding relationships between relief and farmland abandonment in a Mediterranean region (SE Spain).We have taken into account up to 28 different variables in a single analysis, some of them commonly considered in land use change studies (slope, altitude, TWI, etc), but also other novel variables have been evaluated (sky view factor, terrain view factor, etc). The variable selection process provides results in line with the previous knowledge of the study area, describing some processes that are region specific (e.g. abandonment versus intensification of the agricultural activities). The European INSPIRE Directive (2007/2/EC) establishes that the digital elevation models for land surfaces should be available in all member countries, this means that the research described in this work can be extrapolated to any European country to determine whether these variables (slope, altitude, etc) are important in the process of abandonment.
Resumo:
Camera traps have become a widely used technique for conducting biological inventories, generating a large number of database records of great interest. The main aim of this paper is to describe a new free and open source software (FOSS), developed to facilitate the management of camera-trapped data which originated from a protected Mediterranean area (SE Spain). In the last decade, some other useful alternatives have been proposed, but ours focuses especially on a collaborative undertaking and on the importance of spatial information underpinning common camera trap studies. This FOSS application, namely, “Camera Trap Manager” (CTM), has been designed to expedite the processing of pictures on the .NET platform. CTM has a very intuitive user interface, automatic extraction of some image metadata (date, time, moon phase, location, temperature, atmospheric pressure, among others), analytical (Geographical Information Systems, statistics, charts, among others), and reporting capabilities (ESRI Shapefiles, Microsoft Excel Spreadsheets, PDF reports, among others). Using this application, we have achieved a very simple management, fast analysis, and a significant reduction of costs. While we were able to classify an average of 55 pictures per hour manually, CTM has made it possible to process over 1000 photographs per hour, consequently retrieving a greater amount of data.