978 resultados para Core data set
Resumo:
In this work we will discuss about a project started by the Emilia-Romagna Regional Government regarding the manage of the public transport. In particular we will perform a data mining analysis on the data-set of this project. After introducing the Weka software used to make our analysis, we will discover the most useful data mining techniques and algorithms; and we will show how these results can be used to violate the privacy of the same public transport operators. At the end, despite is off topic of this work, we will spend also a few words about how it's possible to prevent this kind of attack.
Resumo:
Data deduplication describes a class of approaches that reduce the storage capacity needed to store data or the amount of data that has to be transferred over a network. These approaches detect coarse-grained redundancies within a data set, e.g. a file system, and remove them.rnrnOne of the most important applications of data deduplication are backup storage systems where these approaches are able to reduce the storage requirements to a small fraction of the logical backup data size.rnThis thesis introduces multiple new extensions of so-called fingerprinting-based data deduplication. It starts with the presentation of a novel system design, which allows using a cluster of servers to perform exact data deduplication with small chunks in a scalable way.rnrnAfterwards, a combination of compression approaches for an important, but often over- looked, data structure in data deduplication systems, so called block and file recipes, is introduced. Using these compression approaches that exploit unique properties of data deduplication systems, the size of these recipes can be reduced by more than 92% in all investigated data sets. As file recipes can occupy a significant fraction of the overall storage capacity of data deduplication systems, the compression enables significant savings.rnrnA technique to increase the write throughput of data deduplication systems, based on the aforementioned block and file recipes, is introduced next. The novel Block Locality Caching (BLC) uses properties of block and file recipes to overcome the chunk lookup disk bottleneck of data deduplication systems. This chunk lookup disk bottleneck either limits the scalability or the throughput of data deduplication systems. The presented BLC overcomes the disk bottleneck more efficiently than existing approaches. Furthermore, it is shown that it is less prone to aging effects.rnrnFinally, it is investigated if large HPC storage systems inhibit redundancies that can be found by fingerprinting-based data deduplication. Over 3 PB of HPC storage data from different data sets have been analyzed. In most data sets, between 20 and 30% of the data can be classified as redundant. According to these results, future work in HPC storage systems should further investigate how data deduplication can be integrated into future HPC storage systems.rnrnThis thesis presents important novel work in different area of data deduplication re- search.
Resumo:
When estimating the effect of treatment on HIV using data from observational studies, standard methods may produce biased estimates due to the presence of time-dependent confounders. Such confounding can be present when a covariate, affected by past exposure, is both a predictor of the future exposure and the outcome. One example is the CD4 cell count, being a marker for disease progression for HIV patients, but also a marker for treatment initiation and influenced by treatment. Fitting a marginal structural model (MSM) using inverse probability weights is one way to give appropriate adjustment for this type of confounding. In this paper we study a simple and intuitive approach to estimate similar treatment effects, using observational data to mimic several randomized controlled trials. Each 'trial' is constructed based on individuals starting treatment in a certain time interval. An overall effect estimate for all such trials is found using composite likelihood inference. The method offers an alternative to the use of inverse probability of treatment weights, which is unstable in certain situations. The estimated parameter is not identical to the one of an MSM, it is conditioned on covariate values at the start of each mimicked trial. This allows the study of questions that are not that easily addressed fitting an MSM. The analysis can be performed as a stratified weighted Cox analysis on the joint data set of all the constructed trials, where each trial is one stratum. The model is applied to data from the Swiss HIV cohort study.
Resumo:
Despite association with lung growth and long-term respiratory morbidity, there is a lack of normative lung function data for unsedated infants conforming to latest European Respiratory Society/American Thoracic Society standards. Lung function was measured using an ultrasonic flow meter in 342 unsedated, healthy, term-born infants at a mean ± sd age of 5.1 ± 0.8 weeks during natural sleep according to the latest standards. Tidal breathing flow-volume loops (TBFVL) and exhaled nitric oxide (eNO) measurements were obtained from 100 regular breaths. We aimed for three acceptable measurements for multiple-breath washout and 5-10 acceptable interruption resistance (R(int)) measurements. Acceptable measurements were obtained in ≤ 285 infants with high variability. Mean values were 7.48 mL·kg⁻¹ (95% limits of agreement 4.95-10.0 mL·kg⁻¹) for tidal volume, 14.3 ppb (2.6-26.1 ppb) for eNO, 23.9 mL·kg⁻¹ (16.0-31.8 mL·kg⁻¹) for functional residual capacity, 6.75 (5.63-7.87) for lung clearance index and 3.78 kPa·s·L⁻¹ (1.14-6.42 kPa·s·L⁻¹) for R(int). In males, TBFVL outcomes were associated with anthropometric parameters and in females, with maternal smoking during pregnancy, maternal asthma and Caesarean section. This large normative data set in unsedated infants offers reference values for future research and particularly for studies where sedation may put infants at risk. Furthermore, it highlights the impact of maternal and environmental risk factors on neonatal lung function.
Resumo:
For the first time we present a multi-proxy data set for the Russian Altai, consisting of Siberian larch tree-ring width (TRW), latewood density (MXD), δ13C and δ18O in cellulose chronologies obtained for the period 1779–2007 and cell wall thickness (CWT) for 1900–2008. All of these parameters agree well between each other in the high-frequency variability, while the low-frequency climate information shows systematic differences. The correlation analysis with temperature and precipitation data from the closest weather station and gridded data revealed that annual TRW, MXD, CWT, and δ13C data contain a strong summer temperature signal, while δ18O in cellulose represents a mixed summer and winter temperature and precipitation signal. The temperature and precipitation reconstructions from the Belukha ice core and Teletskoe lake sediments were used to investigate the correspondence of different independent proxies. Low frequency patterns in TRW and δ13C chronologies are consistent with temperature reconstructions from nearby Belukha ice core and Teletskoe lake sediments showing a pronounced warming trend in the last century. Their combination could be used for the regional temperature reconstruction. The long-term δ18O trend agrees with the precipitation reconstruction from the Teletskoe lake sediment indicating more humid conditions during the twentieth century. Therefore, these two proxies could be combined for the precipitation reconstruction.
Resumo:
This paper proposes Poisson log-linear multilevel models to investigate population variability in sleep state transition rates. We specifically propose a Bayesian Poisson regression model that is more flexible, scalable to larger studies, and easily fit than other attempts in the literature. We further use hierarchical random effects to account for pairings of individuals and repeated measures within those individuals, as comparing diseased to non-diseased subjects while minimizing bias is of epidemiologic importance. We estimate essentially non-parametric piecewise constant hazards and smooth them, and allow for time varying covariates and segment of the night comparisons. The Bayesian Poisson regression is justified through a re-derivation of a classical algebraic likelihood equivalence of Poisson regression with a log(time) offset and survival regression assuming piecewise constant hazards. This relationship allows us to synthesize two methods currently used to analyze sleep transition phenomena: stratified multi-state proportional hazards models and log-linear models with GEE for transition counts. An example data set from the Sleep Heart Health Study is analyzed.
Resumo:
PURPOSE: To describe the implementation and use of an electronic patient-referral system as an aid to the efficient referral of patients to a remote and specialized treatment center. METHODS AND MATERIALS: A system for the exchange of radiotherapy data between different commercial planning systems and a specially developed planning system for proton therapy has been developed through the use of the PAPYRUS diagnostic image standard as an intermediate format. To ensure the cooperation of the different TPS manufacturers, the number of data sets defined for transfer has been restricted to the three core data sets of CT, VOIs, and three-dimensional dose distributions. As a complement to the exchange of data, network-wide application-sharing (video-conferencing) technologies have been adopted to provide methods for the interactive discussion and assessment of treatments plans with one or more partner clinics. RESULTS: Through the use of evaluation plans based on the exchanged data, referring clinics can accurately assess the advantages offered by proton therapy on a patient-by-patient basis, while the practicality or otherwise of the proposed treatments can simultaneously be assessed by the proton therapy center. Such a system, along with the interactive capabilities provided by video-conferencing methods, has been found to be an efficient solution to the problem of patient assessment and selection at a specialized treatment center, and is a necessary first step toward the full electronic integration of such centers with their remotely situated referral centers.
DIMENSION REDUCTION FOR POWER SYSTEM MODELING USING PCA METHODS CONSIDERING INCOMPLETE DATA READINGS
Resumo:
Principal Component Analysis (PCA) is a popular method for dimension reduction that can be used in many fields including data compression, image processing, exploratory data analysis, etc. However, traditional PCA method has several drawbacks, since the traditional PCA method is not efficient for dealing with high dimensional data and cannot be effectively applied to compute accurate enough principal components when handling relatively large portion of missing data. In this report, we propose to use EM-PCA method for dimension reduction of power system measurement with missing data, and provide a comparative study of traditional PCA and EM-PCA methods. Our extensive experimental results show that EM-PCA method is more effective and more accurate for dimension reduction of power system measurement data than traditional PCA method when dealing with large portion of missing data set.
Resumo:
Here we present a study of the 11 yr sunspot cycle's imprint on the Northern Hemisphere atmospheric circulation, using three recently developed gridded upper-air data sets that extend back to the early twentieth century. We find a robust response of the tropospheric late-wintertime circulation to the sunspot cycle, independent from the data set. This response is particularly significant over Europe, although results show that it is not directly related to a North Atlantic Oscillation (NAO) modulation; instead, it reveals a significant connection to the more meridional Eurasian pattern (EU). The magnitude of mean seasonal temperature changes over the European land areas locally exceeds 1 K in the lower troposphere over a sunspot cycle. We also analyse surface data to address the question whether the solar signal over Europe is temporally stable for a longer 250 yr period. The results increase our confidence in the existence of an influence of the 11 yr cycle on the European climate, but the signal is much weaker in the first half of the period compared to the second half. The last solar minimum (2005 to 2010), which was not included in our analysis, shows anomalies that are consistent with our statistical results for earlier solar minima.
Resumo:
Dynamic changes in ERP topographies can be conveniently analyzed by means of microstates, the so-called "atoms of thoughts", that represent brief periods of quasi-stable synchronized network activation. Comparing temporal microstate features such as on- and offset or duration between groups and conditions therefore allows a precise assessment of the timing of cognitive processes. So far, this has been achieved by assigning the individual time-varying ERP maps to spatially defined microstate templates obtained from clustering the grand mean data into predetermined numbers of topographies (microstate prototypes). Features obtained from these individual assignments were then statistically compared. This has the problem that the individual noise dilutes the match between individual topographies and templates leading to lower statistical power. We therefore propose a randomization-based procedure that works without assigning grand-mean microstate prototypes to individual data. In addition, we propose a new criterion to select the optimal number of microstate prototypes based on cross-validation across subjects. After a formal introduction, the method is applied to a sample data set of an N400 experiment and to simulated data with varying signal-to-noise ratios, and the results are compared to existing methods. In a first comparison with previously employed statistical procedures, the new method showed an increased robustness to noise, and a higher sensitivity for more subtle effects of microstate timing. We conclude that the proposed method is well-suited for the assessment of timing differences in cognitive processes. The increased statistical power allows identifying more subtle effects, which is particularly important in small and scarce patient populations.
Resumo:
Persons with Down syndrome (DS) uniquely have an increased frequency of leukemias but a decreased total frequency of solid tumors. The distribution and frequency of specific types of brain tumors have never been studied in DS. We evaluated the frequency of primary neural cell embryonal tumors and gliomas in a large international data set. The observed number of children with DS having a medulloblastoma, central nervous system primitive neuroectodermal tumor (CNS-PNET) or glial tumor was compared to the expected number. Data were collected from cancer registries or brain tumor registries in 13 countries of Europe, America, Asia and Oceania. The number of DS children with each category of tumor was treated as a Poisson variable with mean equal to 0.000884 times the total number of registrations in that category. Among 8,043 neural cell embryonal tumors (6,882 medulloblastomas and 1,161 CNS-PNETs), only one patient with medulloblastoma had DS, while 7.11 children in total and 6.08 with medulloblastoma were expected to have DS. (p 0.016 and 0.0066 respectively). Among 13,797 children with glioma, 10 had DS, whereas 12.2 were expected. Children with DS appear to be specifically protected against primary neural cell embryonal tumors of the CNS, whereas gliomas occur at the same frequency as in the general population. A similar protection against neuroblastoma, the principal extracranial neural cell embryonal tumor, has been observed in children with DS. Additional genetic material on the supernumerary chromosome 21 may protect against embryonal neural cell tumor development.
Resumo:
Historical, i.e. pre-1957, upper-air data are a valuable source of information on the state of the atmosphere, in some parts of the world dating back to the early 20th century. However, to date, reanalyses have only partially made use of these data, and only of observations made after 1948. Even for the period between 1948 (the starting year of the NCEP/NCAR (National Centers for Environmental Prediction/National Center for Atmospheric Research) reanalysis) and the International Geophysical Year in 1957 (the starting year of the ERA-40 reanalysis), when the global upper-air coverage reached more or less its current status, many observations have not yet been digitised. The Comprehensive Historical Upper-Air Network (CHUAN) already compiled a large collection of pre-1957 upper-air data. In the framework of the European project ERA-CLIM (European Reanalysis of Global Climate Observations), significant amounts of additional upper-air data have been catalogued (> 1.3 million station days), imaged (> 200 000 images) and digitised (> 700 000 station days) in order to prepare a new input data set for upcoming reanalyses. The records cover large parts of the globe, focussing on, so far, less well covered regions such as the tropics, the polar regions and the oceans, and on very early upper-air data from Europe and the US. The total number of digitised/inventoried records is 61/101 for moving upper-air data, i.e. data from ships, etc., and 735/1783 for fixed upper-air stations. Here, we give a detailed description of the resulting data set including the metadata and the quality checking procedures applied. The data will be included in the next version of CHUAN. The data are available at doi:10.1594/PANGAEA.821222
Resumo:
Brain tumor is one of the most aggressive types of cancer in humans, with an estimated median survival time of 12 months and only 4% of the patients surviving more than 5 years after disease diagnosis. Until recently, brain tumor prognosis has been based only on clinical information such as tumor grade and patient age, but there are reports indicating that molecular profiling of gliomas can reveal subgroups of patients with distinct survival rates. We hypothesize that coupling molecular profiling of brain tumors with clinical information might improve predictions of patient survival time and, consequently, better guide future treatment decisions. In order to evaluate this hypothesis, the general goal of this research is to build models for survival prediction of glioma patients using DNA molecular profiles (U133 Affymetrix gene expression microarrays) along with clinical information. First, a predictive Random Forest model is built for binary outcomes (i.e. short vs. long-term survival) and a small subset of genes whose expression values can be used to predict survival time is selected. Following, a new statistical methodology is developed for predicting time-to-death outcomes using Bayesian ensemble trees. Due to a large heterogeneity observed within prognostic classes obtained by the Random Forest model, prediction can be improved by relating time-to-death with gene expression profile directly. We propose a Bayesian ensemble model for survival prediction which is appropriate for high-dimensional data such as gene expression data. Our approach is based on the ensemble "sum-of-trees" model which is flexible to incorporate additive and interaction effects between genes. We specify a fully Bayesian hierarchical approach and illustrate our methodology for the CPH, Weibull, and AFT survival models. We overcome the lack of conjugacy using a latent variable formulation to model the covariate effects which decreases computation time for model fitting. Also, our proposed models provides a model-free way to select important predictive prognostic markers based on controlling false discovery rates. We compare the performance of our methods with baseline reference survival methods and apply our methodology to an unpublished data set of brain tumor survival times and gene expression data, selecting genes potentially related to the development of the disease under study. A closing discussion compares results obtained by Random Forest and Bayesian ensemble methods under the biological/clinical perspectives and highlights the statistical advantages and disadvantages of the new methodology in the context of DNA microarray data analysis.
Resumo:
The development of northern high-latitude peatlands played an important role in the carbon (C) balance of the land biosphere since the Last Glacial Maximum (LGM). At present, carbon storage in northern peatlands is substantial and estimated to be 500 ± 100 Pg C (1 Pg C = 1015 g C). Here, we develop and apply a peatland module embedded in a dynamic global vegetation and land surface process model (LPX-Bern 1.0). The peatland module features a dynamic nitrogen cycle, a dynamic C transfer between peatland acrotelm (upper oxic layer) and catotelm (deep anoxic layer), hydrology- and temperature-dependent respiration rates, and peatland specific plant functional types. Nitrogen limitation down-regulates average modern net primary productivity over peatlands by about half. Decadal acrotelm-to-catotelm C fluxes vary between −20 and +50 g C m−2 yr−1 over the Holocene. Key model parameters are calibrated with reconstructed peat accumulation rates from peat-core data. The model reproduces the major features of the peat core data and of the observation-based modern circumpolar soil carbon distribution. Results from a set of simulations for possible evolutions of northern peat development and areal extent show that soil C stocks in modern peatlands increased by 365–550 Pg C since the LGM, of which 175–272 Pg C accumulated between 11 and 5 kyr BP. Furthermore, our simulations suggest a persistent C sequestration rate of 35–50 Pg C per 1000 yr in present-day peatlands under current climate conditions, and that this C sink could either sustain or turn towards a source by 2100 AD depending on climate trajectories as projected for different representative greenhouse gas concentration pathways.
Resumo:
In this study we report on new non-sea salt calcium (nssCa2+, mineral dust proxy) and sea salt sodium (ssNa+, sea ice proxy) records along the East Antarctic Talos Dome deep ice core in centennial resolution reaching back 150 thousand years (ka) before present. During glacial conditions nssCa2+ fluxes in Talos Dome are strongly related to temperature as has been observed before in other deep Antarctic ice core records, and has been associated with synchronous changes in the main source region (southern South America) during climate variations in the last glacial. However, during warmer climate conditions Talos Dome mineral dust input is clearly elevated compared to other records mainly due to the contribution of additional local dust sources in the Ross Sea area. Based on a simple transport model, we compare nssCa2+ fluxes of different East Antarctic ice cores. From this multi-site comparison we conclude that changes in transport efficiency or atmospheric lifetime of dust particles do have a minor effect compared to source strength changes on the large-scale concentration changes observed in Antarctic ice cores during climate variations of the past 150 ka. Our transport model applied on ice core data is further validated by climate model data. The availability of multiple East Antarctic nssCa2+ records also allows for a revision of a former estimate on the atmospheric CO2 sensitivity to reduced dust induced iron fertilisation in the Southern Ocean during the transition from the Last Glacial Maximum to the Holocene (T1). While a former estimate based on the EPICA Dome C (EDC) record only suggested 20 ppm, we find that reduced dust induced iron fertilisation in the Southern Ocean may be responsible for up to 40 ppm of the total atmospheric CO2 increase during T1. During the last interglacial, ssNa+ levels of EDC and EPICA Dronning Maud Land (EDML) are only half of the Holocene levels, in line with higher temperatures during that period, indicating much reduced sea ice extent in the Atlantic as well as the Indian Ocean sector of the Southern Ocean. In contrast, Holocene ssNa+ flux in Talos Dome is about the same as during the last interglacial, indicating that there was similar ice cover present in the Ross Sea area during MIS 5.5 as during the Holocene.