934 resultados para Data anonymization and sanitization


Relevância:

100.00% 100.00%

Publicador:

Resumo:

At present time, there is a lack of knowledge on the interannual climate-related variability of zooplankton communities of the tropical Atlantic, central Mediterranean Sea, Caspian Sea, and Aral Sea, due to the absence of appropriate databases. In the mid latitudes, the North Atlantic Oscillation (NAO) is the dominant mode of atmospheric fluctuations over eastern North America, the northern Atlantic Ocean and Europe. Therefore, one of the issues that need to be addressed through data synthesis is the evaluation of interannual patterns in species abundance and species diversity over these regions in regard to the NAO. The database has been used to investigate the ecological role of the NAO in interannual variations of mesozooplankton abundance and biomass along the zonal array of the NAO influence. Basic approach to the proposed research involved: (1) development of co-operation between experts and data holders in Ukraine, Russia, Kazakhstan, Azerbaijan, UK, and USA to rescue and compile the oceanographic data sets and release them on CD-ROM, (2) organization and compilation of a database based on FSU cruises to the above regions, (3) analysis of the basin-scale interannual variability of the zooplankton species abundance, biomass, and species diversity.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

During the past five million yrs, benthic d18O records indicate a large range of climates, from warmer than today during the Pliocene Warm Period to considerably colder during glacials. Antarctic ice cores have revealed Pleistocene glacial-interglacial CO2 variability of 60-100 ppm, while sea level fluctuations of typically 125 m are documented by proxy data. However, in the pre-ice core period, CO2 and sea level proxy data are scarce and there is disagreement between different proxies and different records of the same proxy. This hampers comprehensive understanding of the long-term relations between CO2, sea level and climate. Here, we drive a coupled climate-ice sheet model over the past five million years, inversely forced by a stacked benthic d18O record. We obtain continuous simulations of benthic d18O, sea level and CO2 that are mutually consistent. Our model shows CO2 concentrations of 300 to 470 ppm during the Early Pliocene. Furthermore, we simulate strong CO2 variability during the Pliocene and Early Pleistocene. These features are broadly supported by existing and new d11B-based proxy CO2 data, but less by alkenone-based records. The simulated concentrations and variations therein are larger than expected from global mean temperature changes. Our findings thus suggest a smaller Earth System Sensitivity than previously thought. This is explained by a more restricted role of land ice variability in the Pliocene. The largest uncertainty in our simulation arises from the mass balance formulation of East Antarctica, which governs the variability in sea level, but only modestly affects the modeled CO2 concentrations.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Over the last decade, several hundred seals have been equipped with conductivity-temperature-depth sensors in the Southern Ocean for both biological and physical oceanographic studies. A calibrated collection of seal-derived hydrographic data is now available, consisting of more than 165,000 profiles. The value of these hydrographic data within the existing Southern Ocean observing system is demonstrated herein by conducting two state estimation experiments, differing only in the use or not of seal data to constrain the system. Including seal-derived data substantially modifies the estimated surface mixedlayer properties and circulation patterns within and south of the Antarctic Circumpolar Current. Agreement with independent satellite observations of sea ice concentration is improved, especially along the East Antarctic shelf. Instrumented animals efficiently reduce a critical observational gap, and their contribution to monitoring polar climate variability will continue to grow as data accuracy and spatial coverage increase.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper describes seagrass species and percentage cover point-based field data sets derived from georeferenced photo transects. Annually or biannually over a ten year period (2004-2015) data sets were collected using 30-50 transects, 500-800 m in length distributed across a 142 km**2 shallow, clear water seagrass habitat, the Eastern Banks, Moreton Bay, Australia. Each of the eight data sets include seagrass property information derived from approximately 3000 georeferenced, downward looking photographs captured at 2-4 m intervals along the transects. Photographs were manually interpreted to estimate seagrass species composition and percentage cover (Coral Point Count excel; CPCe). Understanding seagrass biology, ecology and dynamics for scientific and management purposes requires point-based data on species composition and cover. This data set, and the methods used to derive it are a globally unique example for seagrass ecological applications. It provides the basis for multiple further studies at this site, regional to global comparative studies, and, for the design of similar monitoring programs elsewhere.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Measures have been developed to understand tendencies in the distribution of economic activity. The merits of these measures are in the convenience of data collection and processing. In this interim report, investigating the property of such measures to determine the geographical spread of economic activities, we summarize the merits and limitations of measures, and make clear that we must apply caution in their usage. As a first trial to access areal data, this project focus on administrative areas, not on point data and input-output data. Firm level data is not within the scope of this article. The rest of this article is organized as follows. In Section 2, we touch on the the limitations and problems associated with the measures and areal data. Specific measures are introduced in Section 3, and applied in Section 4. The conclusion summarizes the findings and discusses future work.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In a series of attempts to research and document relevant sloshing type phenomena, a series of experiments have been conducted. The aim of this paper is to describe the setup and data processing of such experiments. A sloshing tank is subjected to angular motion. As a result pressure registers are obtained at several locations, together with the motion data, torque and a collection of image and video information. The experimental rig and the data acquisition systems are described. Useful information for experimental sloshing research practitioners is provided. This information is related to the liquids used in the experiments, the dying techniques, tank building processes, synchronization of acquisition systems, etc. A new procedure for reconstructing experimental data, that takes into account experimental uncertainties, is presented. This procedure is based on a least squares spline approximation of the data. Based on a deterministic approach to the first sloshing wave impact event in a sloshing experiment, an uncertainty analysis procedure of the associated first pressure peak value is described.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The development of new-generation intelligent vehicle technologies will lead to a better level of road safety and CO2 emission reductions. However, the weak point of all these systems is their need for comprehensive and reliable data. For traffic data acquisition, two sources are currently available: 1) infrastructure sensors and 2) floating vehicles. The former consists of a set of fixed point detectors installed in the roads, and the latter consists of the use of mobile probe vehicles as mobile sensors. However, both systems still have some deficiencies. The infrastructure sensors retrieve information fromstatic points of the road, which are spaced, in some cases, kilometers apart. This means that the picture of the actual traffic situation is not a real one. This deficiency is corrected by floating cars, which retrieve dynamic information on the traffic situation. Unfortunately, the number of floating data vehicles currently available is too small and insufficient to give a complete picture of the road traffic. In this paper, we present a floating car data (FCD) augmentation system that combines information fromfloating data vehicles and infrastructure sensors, and that, by using neural networks, is capable of incrementing the amount of FCD with virtual information. This system has been implemented and tested on actual roads, and the results show little difference between the data supplied by the floating vehicles and the virtual vehicles.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Hoy en día, con la evolución continua y rápida de las tecnologías de la información y los dispositivos de computación, se recogen y almacenan continuamente grandes volúmenes de datos en distintos dominios y a través de diversas aplicaciones del mundo real. La extracción de conocimiento útil de una cantidad tan enorme de datos no se puede realizar habitualmente de forma manual, y requiere el uso de técnicas adecuadas de aprendizaje automático y de minería de datos. La clasificación es una de las técnicas más importantes que ha sido aplicada con éxito a varias áreas. En general, la clasificación se compone de dos pasos principales: en primer lugar, aprender un modelo de clasificación o clasificador a partir de un conjunto de datos de entrenamiento, y en segundo lugar, clasificar las nuevas instancias de datos utilizando el clasificador aprendido. La clasificación es supervisada cuando todas las etiquetas están presentes en los datos de entrenamiento (es decir, datos completamente etiquetados), semi-supervisada cuando sólo algunas etiquetas son conocidas (es decir, datos parcialmente etiquetados), y no supervisada cuando todas las etiquetas están ausentes en los datos de entrenamiento (es decir, datos no etiquetados). Además, aparte de esta taxonomía, el problema de clasificación se puede categorizar en unidimensional o multidimensional en función del número de variables clase, una o más, respectivamente; o también puede ser categorizado en estacionario o cambiante con el tiempo en función de las características de los datos y de la tasa de cambio subyacente. A lo largo de esta tesis, tratamos el problema de clasificación desde tres perspectivas diferentes, a saber, clasificación supervisada multidimensional estacionaria, clasificación semisupervisada unidimensional cambiante con el tiempo, y clasificación supervisada multidimensional cambiante con el tiempo. Para llevar a cabo esta tarea, hemos usado básicamente los clasificadores Bayesianos como modelos. La primera contribución, dirigiéndose al problema de clasificación supervisada multidimensional estacionaria, se compone de dos nuevos métodos de aprendizaje de clasificadores Bayesianos multidimensionales a partir de datos estacionarios. Los métodos se proponen desde dos puntos de vista diferentes. El primer método, denominado CB-MBC, se basa en una estrategia de envoltura de selección de variables que es voraz y hacia delante, mientras que el segundo, denominado MB-MBC, es una estrategia de filtrado de variables con una aproximación basada en restricciones y en el manto de Markov. Ambos métodos han sido aplicados a dos problemas reales importantes, a saber, la predicción de los inhibidores de la transcriptasa inversa y de la proteasa para el problema de infección por el virus de la inmunodeficiencia humana tipo 1 (HIV-1), y la predicción del European Quality of Life-5 Dimensions (EQ-5D) a partir de los cuestionarios de la enfermedad de Parkinson con 39 ítems (PDQ-39). El estudio experimental incluye comparaciones de CB-MBC y MB-MBC con los métodos del estado del arte de la clasificación multidimensional, así como con métodos comúnmente utilizados para resolver el problema de predicción de la enfermedad de Parkinson, a saber, la regresión logística multinomial, mínimos cuadrados ordinarios, y mínimas desviaciones absolutas censuradas. En ambas aplicaciones, los resultados han sido prometedores con respecto a la precisión de la clasificación, así como en relación al análisis de las estructuras gráficas que identifican interacciones conocidas y novedosas entre las variables. La segunda contribución, referida al problema de clasificación semi-supervisada unidimensional cambiante con el tiempo, consiste en un método nuevo (CPL-DS) para clasificar flujos de datos parcialmente etiquetados. Los flujos de datos difieren de los conjuntos de datos estacionarios en su proceso de generación muy rápido y en su aspecto de cambio de concepto. Es decir, los conceptos aprendidos y/o la distribución subyacente están probablemente cambiando y evolucionando en el tiempo, lo que hace que el modelo de clasificación actual sea obsoleto y deba ser actualizado. CPL-DS utiliza la divergencia de Kullback-Leibler y el método de bootstrapping para cuantificar y detectar tres tipos posibles de cambio: en las predictoras, en la a posteriori de la clase o en ambas. Después, si se detecta cualquier cambio, un nuevo modelo de clasificación se aprende usando el algoritmo EM; si no, el modelo de clasificación actual se mantiene sin modificaciones. CPL-DS es general, ya que puede ser aplicado a varios modelos de clasificación. Usando dos modelos diferentes, el clasificador naive Bayes y la regresión logística, CPL-DS se ha probado con flujos de datos sintéticos y también se ha aplicado al problema real de la detección de código malware, en el cual los nuevos ficheros recibidos deben ser continuamente clasificados en malware o goodware. Los resultados experimentales muestran que nuestro método es efectivo para la detección de diferentes tipos de cambio a partir de los flujos de datos parcialmente etiquetados y también tiene una buena precisión de la clasificación. Finalmente, la tercera contribución, sobre el problema de clasificación supervisada multidimensional cambiante con el tiempo, consiste en dos métodos adaptativos, a saber, Locally Adpative-MB-MBC (LA-MB-MBC) y Globally Adpative-MB-MBC (GA-MB-MBC). Ambos métodos monitorizan el cambio de concepto a lo largo del tiempo utilizando la log-verosimilitud media como métrica y el test de Page-Hinkley. Luego, si se detecta un cambio de concepto, LA-MB-MBC adapta el actual clasificador Bayesiano multidimensional localmente alrededor de cada nodo cambiado, mientras que GA-MB-MBC aprende un nuevo clasificador Bayesiano multidimensional. El estudio experimental realizado usando flujos de datos sintéticos multidimensionales indica los méritos de los métodos adaptativos propuestos. ABSTRACT Nowadays, with the ongoing and rapid evolution of information technology and computing devices, large volumes of data are continuously collected and stored in different domains and through various real-world applications. Extracting useful knowledge from such a huge amount of data usually cannot be performed manually, and requires the use of adequate machine learning and data mining techniques. Classification is one of the most important techniques that has been successfully applied to several areas. Roughly speaking, classification consists of two main steps: first, learn a classification model or classifier from an available training data, and secondly, classify the new incoming unseen data instances using the learned classifier. Classification is supervised when the whole class values are present in the training data (i.e., fully labeled data), semi-supervised when only some class values are known (i.e., partially labeled data), and unsupervised when the whole class values are missing in the training data (i.e., unlabeled data). In addition, besides this taxonomy, the classification problem can be categorized into uni-dimensional or multi-dimensional depending on the number of class variables, one or more, respectively; or can be also categorized into stationary or streaming depending on the characteristics of the data and the rate of change underlying it. Through this thesis, we deal with the classification problem under three different settings, namely, supervised multi-dimensional stationary classification, semi-supervised unidimensional streaming classification, and supervised multi-dimensional streaming classification. To accomplish this task, we basically used Bayesian network classifiers as models. The first contribution, addressing the supervised multi-dimensional stationary classification problem, consists of two new methods for learning multi-dimensional Bayesian network classifiers from stationary data. They are proposed from two different points of view. The first method, named CB-MBC, is based on a wrapper greedy forward selection approach, while the second one, named MB-MBC, is a filter constraint-based approach based on Markov blankets. Both methods are applied to two important real-world problems, namely, the prediction of the human immunodeficiency virus type 1 (HIV-1) reverse transcriptase and protease inhibitors, and the prediction of the European Quality of Life-5 Dimensions (EQ-5D) from 39-item Parkinson’s Disease Questionnaire (PDQ-39). The experimental study includes comparisons of CB-MBC and MB-MBC against state-of-the-art multi-dimensional classification methods, as well as against commonly used methods for solving the Parkinson’s disease prediction problem, namely, multinomial logistic regression, ordinary least squares, and censored least absolute deviations. For both considered case studies, results are promising in terms of classification accuracy as well as regarding the analysis of the learned MBC graphical structures identifying known and novel interactions among variables. The second contribution, addressing the semi-supervised uni-dimensional streaming classification problem, consists of a novel method (CPL-DS) for classifying partially labeled data streams. Data streams differ from the stationary data sets by their highly rapid generation process and their concept-drifting aspect. That is, the learned concepts and/or the underlying distribution are likely changing and evolving over time, which makes the current classification model out-of-date requiring to be updated. CPL-DS uses the Kullback-Leibler divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classification model is learned using the expectation-maximization algorithm; otherwise, the current classification model is kept unchanged. CPL-DS is general as it can be applied to several classification models. Using two different models, namely, naive Bayes classifier and logistic regression, CPL-DS is tested with synthetic data streams and applied to the real-world problem of malware detection, where the new received files should be continuously classified into malware or goodware. Experimental results show that our approach is effective for detecting different kinds of drift from partially labeled data streams, as well as having a good classification performance. Finally, the third contribution, addressing the supervised multi-dimensional streaming classification problem, consists of two adaptive methods, namely, Locally Adaptive-MB-MBC (LA-MB-MBC) and Globally Adaptive-MB-MBC (GA-MB-MBC). Both methods monitor the concept drift over time using the average log-likelihood score and the Page-Hinkley test. Then, if a drift is detected, LA-MB-MBC adapts the current multi-dimensional Bayesian network classifier locally around each changed node, whereas GA-MB-MBC learns a new multi-dimensional Bayesian network classifier from scratch. Experimental study carried out using synthetic multi-dimensional data streams shows the merits of both proposed adaptive methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Replication Data Management (RDM) aims at enabling the use of data collections from several iterations of an experiment. However, there are several major challenges to RDM from integrating data models and data from empirical study infrastructures that were not designed to cooperate, e.g., data model variation of local data sources. [Objective] In this paper we analyze RDM needs and evaluate conceptual RDM approaches to support replication researchers. [Method] We adapted the ATAM evaluation process to (a) analyze RDM use cases and needs of empirical replication study research groups and (b) compare three conceptual approaches to address these RDM needs: central data repositories with a fixed data model, heterogeneous local repositories, and an empirical ecosystem. [Results] While the central and local approaches have major issues that are hard to resolve in practice, the empirical ecosystem allows bridging current gaps in RDM from heterogeneous data sources. [Conclusions] The empirical ecosystem approach should be explored in diverse empirical environments.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Effective data summarization methods that use AI techniques can help humans understand large sets of data. In this paper, we describe a knowledge-based method for automatically generating summaries of geospatial and temporal data, i.e. data with geographical and temporal references. The method is useful for summarizing data streams, such as GPS traces and traffic information, that are becoming more prevalent with the increasing use of sensors in computing devices. The method presented here is an initial architecture for our ongoing research in this domain. In this paper we describe the data representations we have designed for our method, our implementations of components to perform data abstraction and natural language generation. We also discuss evaluation results that show the ability of our method to generate certain types of geospatial and temporal descriptions.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Funding The International Primary Care Respiratory Group (IPCRG) provided funding for this research project as an UNLOCK group study for which the funding was obtained through an unrestricted grant by Novartis AG, Basel, Switzerland. The latter funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Database access for the OPCRD was provided by the Respiratory Effectiveness Group (REG) and Research in Real Life; the OPCRD statistical analysis was funded by REG. The Bocholtz Study was funded by PICASSO for COPD, an initiative of Boehringer Ingelheim, Pfizer and the Caphri Research Institute, Maastricht University, The Netherlands.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This work was financially supported by the German Federal Ministry of Food and Agriculture (BMEL) through the Federal Office for Agriculture and Food (BLE), (2851ERA01J). FT and RPR were supported by FACCE MACSUR (3200009600) through the Finnish Ministry of Agriculture and Forestry (MMM). EC, HE and EL were supported by The Swedish Research Council for Environment, Agricultural Sciences and Spatial Planning (220-2007-1218) and by the strategic funding ‘Soil-Water-Landscape’ from the faculty of Natural Resources and Agricultural Sciences (Swedish University of Agricultural Sciences) and thank professor P-E Jansson (Royal Institute of Technology, Stockholm) for support. JC, HR and DW thank the INRA ACCAF metaprogramm for funding and Eric Casellas from UR MIAT INRA for support. CB was funded by the Helmholtz project “REKLIM—Regional Climate Change”. CK was funded by the HGF Alliance “Remote Sensing and Earth System Dynamics” (EDA). FH was funded by the German Research Foundation (Deutsche Forschungsgemeinschaft, DFG) under the Grant FOR1695. FE and SS acknowledge support by the German Science Foundation (project EW 119/5-1). HH, GZ, SS, TG and FE thank Andreas Enders and Gunther Krauss (INRES, University of Bonn) for support. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

BodyMap is a human and mouse gene expression database that is based on site-directed 3′-expressed sequence tags generated at Osaka University. To date, it contains more than 300 000 tag sequences from 64 human and 39 mouse tissues. For the recent release, the precise anatomical expression patterns for more than half of the human gene entries were generated by introduced amplified fragment length polymorphism (iAFLP), which is a PCR-based high-throughput expression profiling method. The iAFLP data incorporated into BodyMap describe the relative contents of more than 12 000 transcripts across 30 tissue RNAs. In addition, a newly developed gene ranking system helps users obtain lists of genes that have desired expression patterns according to their significance. BodyMap supports complete transfer of unique data sets and provides analysis that is accessible through the WWW at http://bodymap.ims.u-tokyo.ac.jp.