960 resultados para data sets


Relevância:

70.00% 70.00%

Publicador:

Resumo:

We investigate adaptive buffer management techniques for approximate evaluation of sliding window joins over multiple data streams. In many applications, data stream processing systems have limited memory or have to deal with very high speed data streams. In both cases, computing the exact results of joins between these streams may not be feasible, mainly because the buffers used to compute the joins contain much smaller number of tuples than the tuples contained in the sliding windows. Therefore, a stream buffer management policy is needed in that case. We show that the buffer replacement policy is an important determinant of the quality of the produced results. To that end, we propose GreedyDual-Join (GDJ) an adaptive and locality-aware buffering technique for managing these buffers. GDJ exploits the temporal correlations (at both long and short time scales), which we found to be prevalent in many real data streams. We note that our algorithm is readily applicable to multiple data streams and multiple joins and requires almost no additional system resources. We report results of an experimental study using both synthetic and real-world data sets. Our results demonstrate the superiority and flexibility of our approach when contrasted to other recently proposed techniques.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The data streaming model provides an attractive framework for one-pass summarization of massive data sets at a single observation point. However, in an environment where multiple data streams arrive at a set of distributed observation points, sketches must be computed remotely and then must be aggregated through a hierarchy before queries may be conducted. As a result, many sketch-based methods for the single stream case do not apply directly, as either the error introduced becomes large, or because the methods assume that the streams are non-overlapping. These limitations hinder the application of these techniques to practical problems in network traffic monitoring and aggregation in sensor networks. To address this, we develop a general framework for evaluating and enabling robust computation of duplicate-sensitive aggregate functions (e.g., SUM and QUANTILE), over data produced by distributed sources. We instantiate our approach by augmenting the Count-Min and Quantile-Digest sketches to apply in this distributed setting, and analyze their performance. We conclude with experimental evaluation to validate our analysis.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

As more diagnostic testing options become available to physicians, it becomes more difficult to combine various types of medical information together in order to optimize the overall diagnosis. To improve diagnostic performance, here we introduce an approach to optimize a decision-fusion technique to combine heterogeneous information, such as from different modalities, feature categories, or institutions. For classifier comparison we used two performance metrics: The receiving operator characteristic (ROC) area under the curve [area under the ROC curve (AUC)] and the normalized partial area under the curve (pAUC). This study used four classifiers: Linear discriminant analysis (LDA), artificial neural network (ANN), and two variants of our decision-fusion technique, AUC-optimized (DF-A) and pAUC-optimized (DF-P) decision fusion. We applied each of these classifiers with 100-fold cross-validation to two heterogeneous breast cancer data sets: One of mass lesion features and a much more challenging one of microcalcification lesion features. For the calcification data set, DF-A outperformed the other classifiers in terms of AUC (p < 0.02) and achieved AUC=0.85 +/- 0.01. The DF-P surpassed the other classifiers in terms of pAUC (p < 0.01) and reached pAUC=0.38 +/- 0.02. For the mass data set, DF-A outperformed both the ANN and the LDA (p < 0.04) and achieved AUC=0.94 +/- 0.01. Although for this data set there were no statistically significant differences among the classifiers' pAUC values (pAUC=0.57 +/- 0.07 to 0.67 +/- 0.05, p > 0.10), the DF-P did significantly improve specificity versus the LDA at both 98% and 100% sensitivity (p < 0.04). In conclusion, decision fusion directly optimized clinically significant performance measures, such as AUC and pAUC, and sometimes outperformed two well-known machine-learning techniques when applied to two different breast cancer data sets.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

BACKGROUND: Determining the evolutionary relationships among the major lineages of extant birds has been one of the biggest challenges in systematic biology. To address this challenge, we assembled or collected the genomes of 48 avian species spanning most orders of birds, including all Neognathae and two of the five Palaeognathae orders. We used these genomes to construct a genome-scale avian phylogenetic tree and perform comparative genomic analyses. FINDINGS: Here we present the datasets associated with the phylogenomic analyses, which include sequence alignment files consisting of nucleotides, amino acids, indels, and transposable elements, as well as tree files containing gene trees and species trees. Inferring an accurate phylogeny required generating: 1) A well annotated data set across species based on genome synteny; 2) Alignments with unaligned or incorrectly overaligned sequences filtered out; and 3) Diverse data sets, including genes and their inferred trees, indels, and transposable elements. Our total evidence nucleotide tree (TENT) data set (consisting of exons, introns, and UCEs) gave what we consider our most reliable species tree when using the concatenation-based ExaML algorithm or when using statistical binning with the coalescence-based MP-EST algorithm (which we refer to as MP-EST*). Other data sets, such as the coding sequence of some exons, revealed other properties of genome evolution, namely convergence. CONCLUSIONS: The Avian Phylogenomics Project is the largest vertebrate phylogenomics project to date that we are aware of. The sequence, alignment, and tree data are expected to accelerate analyses in phylogenomics and other related areas.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

PURPOSE: X-ray computed tomography (CT) is widely used, both clinically and preclinically, for fast, high-resolution anatomic imaging; however, compelling opportunities exist to expand its use in functional imaging applications. For instance, spectral information combined with nanoparticle contrast agents enables quantification of tissue perfusion levels, while temporal information details cardiac and respiratory dynamics. The authors propose and demonstrate a projection acquisition and reconstruction strategy for 5D CT (3D+dual energy+time) which recovers spectral and temporal information without substantially increasing radiation dose or sampling time relative to anatomic imaging protocols. METHODS: The authors approach the 5D reconstruction problem within the framework of low-rank and sparse matrix decomposition. Unlike previous work on rank-sparsity constrained CT reconstruction, the authors establish an explicit rank-sparse signal model to describe the spectral and temporal dimensions. The spectral dimension is represented as a well-sampled time and energy averaged image plus regularly undersampled principal components describing the spectral contrast. The temporal dimension is represented as the same time and energy averaged reconstruction plus contiguous, spatially sparse, and irregularly sampled temporal contrast images. Using a nonlinear, image domain filtration approach, the authors refer to as rank-sparse kernel regression, the authors transfer image structure from the well-sampled time and energy averaged reconstruction to the spectral and temporal contrast images. This regularization strategy strictly constrains the reconstruction problem while approximately separating the temporal and spectral dimensions. Separability results in a highly compressed representation for the 5D data in which projections are shared between the temporal and spectral reconstruction subproblems, enabling substantial undersampling. The authors solved the 5D reconstruction problem using the split Bregman method and GPU-based implementations of backprojection, reprojection, and kernel regression. Using a preclinical mouse model, the authors apply the proposed algorithm to study myocardial injury following radiation treatment of breast cancer. RESULTS: Quantitative 5D simulations are performed using the MOBY mouse phantom. Twenty data sets (ten cardiac phases, two energies) are reconstructed with 88 μm, isotropic voxels from 450 total projections acquired over a single 360° rotation. In vivo 5D myocardial injury data sets acquired in two mice injected with gold and iodine nanoparticles are also reconstructed with 20 data sets per mouse using the same acquisition parameters (dose: ∼60 mGy). For both the simulations and the in vivo data, the reconstruction quality is sufficient to perform material decomposition into gold and iodine maps to localize the extent of myocardial injury (gold accumulation) and to measure cardiac functional metrics (vascular iodine). Their 5D CT imaging protocol represents a 95% reduction in radiation dose per cardiac phase and energy and a 40-fold decrease in projection sampling time relative to their standard imaging protocol. CONCLUSIONS: Their 5D CT data acquisition and reconstruction protocol efficiently exploits the rank-sparse nature of spectral and temporal CT data to provide high-fidelity reconstruction results without increased radiation dose or sampling time.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The SB distributional model of Johnson's 1949 paper was introduced by a transformation to normality, that is, z ~ N(0, 1), consisting of a linear scaling to the range (0, 1), a logit transformation, and an affine transformation, z = γ + δu. The model, in its original parameterization, has often been used in forest diameter distribution modelling. In this paper, we define the SB distribution in terms of the inverse transformation from normality, including an initial linear scaling transformation, u = γ′ + δ′z (δ′ = 1/δ and γ′ = �γ/δ). The SB model in terms of the new parameterization is derived, and maximum likelihood estimation schema are presented for both model parameterizations. The statistical properties of the two alternative parameterizations are compared empirically on 20 data sets of diameter distributions of Changbai larch (Larix olgensis Henry). The new parameterization is shown to be statistically better than Johnson's original parameterization for the data sets considered here.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In this paper, the buildingEXODUS evacuation model is described and discussed and attempts at qualitative and quantitative model validation are presented. The data sets used for validation are the Stapelfeldt and Milburn House evacuation data. As part of the validation exercise, the sensitivity of the building-EXODUS predictions to a range of variables is examined, including occupant drive, occupant location, exit flow capacity, exit size, occupant response times and geometry definition. An important consideration that has been highlighted by this work is that any validation exercise must be scrutinised to identify both the results generated and the considerations and assumptions on which they are based. During the course of the validation exercise, both data sets were found to be less than ideal for the purpose of validating complex evacuation. However, the buildingEXODUS evacuation model was found to be able to produce reasonable qualitative and quantitative agreement with the experimental data.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This document is the first out of three iterations of the DMP that will be formally delivered during the project. Version 2 is due in month 24 and version 3 towards the end of the project. The DMP thus is not a fixed document; it evolves and gains more precision and substance during the lifespan of the project. In this first version we describe the planned research data sets related to the RAGE evaluation and validation activities, and the fifteen principles that will guide data management in RAGE. The former are described in the format of the EU data management template, and the latter in terms of their guiding principle, how we propose to implement them, and when they will be implemented. This document is thus first of all relevant to WP5 and WP8 members.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

A robust method for fitting to the results of gel electrophoresis assays of damage to plasmid DNA caused by radiation is presented. This method makes use of nonlinear regression to fit analytically derived dose response curves to observations of the supercoiled, open circular and linear plasmid forms simultaneously, allowing for more accurate results than fitting to individual forms. Comparisons with a commonly used analysis method show that while there is a relatively small benefit between the methods for data sets with small errors, the parameters generated by this method remain much more closely distributed around the true value in the face of increasing measurement uncertainties. This allows for parameters to be specified with greater confidence, reflected in a reduction of errors on fitted parameters. On test data sets, fitted uncertainties were reduced by 30%, similar to the improvement that would be offered by moving from triplicate to fivefold repeats (assuming standard errors). This method has been implemented in a popular spreadsheet package and made available online to improve its accessibility. (C) 2011 by Radiation Research Society

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Recent years have witnessed an incredibly increasing interest in the topic of incremental learning. Unlike conventional machine learning situations, data flow targeted by incremental learning becomes available continuously over time. Accordingly, it is desirable to be able to abandon the traditional assumption of the availability of representative training data during the training period to develop decision boundaries. Under scenarios of continuous data flow, the challenge is how to transform the vast amount of stream raw data into information and knowledge representation, and accumulate experience over time to support future decision-making process. In this paper, we propose a general adaptive incremental learning framework named ADAIN that is capable of learning from continuous raw data, accumulating experience over time, and using such knowledge to improve future learning and prediction performance. Detailed system level architecture and design strategies are presented in this paper. Simulation results over several real-world data sets are used to validate the effectiveness of this method.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Many of the most interesting questions ecologists ask lead to analyses of spatial data. Yet, perhaps confused by the large number of statistical models and fitting methods available, many ecologists seem to believe this is best left to specialists. Here, we describe the issues that need consideration when analysing spatial data and illustrate these using simulation studies. Our comparative analysis involves using methods including generalized least squares, spatial filters, wavelet revised models, conditional autoregressive models and generalized additive mixed models to estimate regression coefficients from synthetic but realistic data sets, including some which violate standard regression assumptions. We assess the performance of each method using two measures and using statistical error rates for model selection. Methods that performed well included generalized least squares family of models and a Bayesian implementation of the conditional auto-regressive model. Ordinary least squares also performed adequately in the absence of model selection, but had poorly controlled Type I error rates and so did not show the improvements in performance under model selection when using the above methods. Removing large-scale spatial trends in the response led to poor performance. These are empirical results; hence extrapolation of these findings to other situations should be performed cautiously. Nevertheless, our simulation-based approach provides much stronger evidence for comparative analysis than assessments based on single or small numbers of data sets, and should be considered a necessary foundation for statements of this type in future.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

1) Executive Summary
Legislation (Autism Act NI, 2011), a cross-departmental strategy (Autism Strategy 2013-2020) and a first action plan (2013-2016) have been developed in Northern Ireland in order to support individuals and families affected by Autism Spectrum Disorder (ASD) without a prior thorough baseline assessment of need. At the same time, there are large existing data sets about the population in NI that had never been subjected to a secondary data analysis with regards to data on ASD. This report covers the first comprehensive secondary data analysis and thereby aims to inform future policy and practice.
Following a search of all existing, large-scale, regional or national data sets that were relevant to the lives of individuals and families affected by Autism Spectrum Disorder (ASD) in Northern Ireland, extensive secondary data analyses were carried out. The focus of these secondary data analyses was to distill any ASD related data from larger generic data sets. The findings are reported for each data set and follow a lifespan perspective, i.e., data related to children is reported first before data related to adults.
Key findings:
Autism Prevalence:
Of children born in 2000 in the UK,
• 0.9% (1:109) were reported to have ASD, when they were 5-year old in 2005;
• 1.8% (1:55) were reported to have ASD, when they were 7-years old in 2007;
• 3.5% (1:29) were reported to have ASD, when they were 11-year old in 2011.
In mainstream schools in Northern Ireland
• 1.2% of the children were reported to have ASD in 2006/07;
• 1.8% of the children were reported to have ASD in 2012/13.

Economic Deprivation:
• Families of children with autism (CWA) were 9%-18% worse off per week than families of children not on the autism spectrum (COA).
• Between 2006-2013 deprivation of CWA compared to COA nearly doubled as measured by eligibility for free school meals (from near 20 % to 37%)
• In 2006, CWA and COA experienced similar levels of deprivation (approx. 20%), by 2013, a considerable deprivation gap had developed, with CWA experienced 6% more deprivation than COA.
• Nearly 1/3 of primary school CWA lived in the most deprived areas in Northern Ireland.
• Nearly ½ of children with Asperger’s Syndrome who attended special school lived in the most deprived areas.

Unemployment:
• Mothers of CWA were 6% less likely to be employed than mothers of COA.
• Mothers of CWA earned 35%-56% less than mothers of COA.
• CWA were 9% less likely to live in two income families than COA.

Health:
• Pre-diagnosis, CWA were more likely than COA to have physical health problems, including walking on level ground, speech and language, hearing, eyesight, and asthma.
• Aged 3 years of age CWA experienced poorer emotional and social health than COA, this difference increased significantly by the time they were 7 years of age.
• Mothers of young CWA had lower levels of life satisfaction and poorer mental health than mothers of young COA.
Education:
• In mainstream education, children with ASD aged 11-16 years reported less satisfaction with their social relationships than COA.
• Younger children with ASD (aged 5 and 7 years) were less likely to enjoy school, were bullied more, and were more reluctant to attend school than COA.
• CWA attended school 2-3 weeks less than COA .
• Children with Asperger’s Syndrome in special schools missed the equivalent of 8-13 school days more than children with Asperger’s Syndrome in mainstream schools.
• Children with ASD attending mainstream schooling were less likely to gain 5+ GCSEs A*-C or subsequently attend university.



Further and Higher Education:
• Enrolment rates for students with ASD have risen in Further Education (FE), from 0% to 0.7%.
• Enrolment rates for students with ASD have risen in Higher Education (HE), from 0.28% to 0.45%.
• Students with ASD chose to study different subjects than students without ASD, although other factors, e.g., gender, age etc. may have played a part in subject selection.
• Students with ASD from NI were more likely than students without ASD to choose Northern Irish HE Institutions rather than study outside NI.

Participation in adult life and employment:
• A small number of adults with ASD (n=99) have benefitted from DES employment provision over the past 12 years.
• It is unknown how many adults with ASD have received employment support elsewhere (e.g. Steps to Work).

Awareness and Attitudes in the General Population:
• In both the 2003 and 2012 NI Life and Times Survey (NILTS), NI public reported positive attitudes towards the inclusion of children with ASD in mainstream education (see also BASE Project Vol. 2).

Gap Analysis Recommendations:
This was the first comprehensive secondary analysis with regards to ASD of existing large-scale data sets in Northern Ireland. Data gaps were identified and further replications would benefit from the following data inclusion:
• ASD should be recorded routinely in the following datasets:
o Census;
o Northern Ireland Survey of Activity Limitation (NISALD);
o Training for Success/Steps to work; Steps to Success;
o Travel survey;
o Hate crime; and
o Labour Force Survey.
Data should be collected on the destinations/qualifications of special school leavers.
• NILT Survey autism module should be repeated in 5 years time (2017) (see full report of 1st NILT Survey autism module 2012 in BASE Project Report Volume 2).
• General public attitudes and awareness should be assessed for children and young people, using the Young Life and Times Survey (YLT) and the Kids Life and Times Survey (KLT); (this work is underway, Dillenburger, McKerr, Schubolz, & Lloyd, 2014-2015).

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In the study of complex genetic diseases, the identification of subgroups of patients sharing similar genetic characteristics represents a challenging task, for example, to improve treatment decision. One type of genetic lesion, frequently investigated in such disorders, is the change of the DNA copy number (CN) at specific genomic traits. Non-negative Matrix Factorization (NMF) is a standard technique to reduce the dimensionality of a data set and to cluster data samples, while keeping its most relevant information in meaningful components. Thus, it can be used to discover subgroups of patients from CN profiles. It is however computationally impractical for very high dimensional data, such as CN microarray data. Deciding the most suitable number of subgroups is also a challenging problem. The aim of this work is to derive a procedure to compact high dimensional data, in order to improve NMF applicability without compromising the quality of the clustering. This is particularly important for analyzing high-resolution microarray data. Many commonly used quality measures, as well as our own measures, are employed to decide the number of subgroups and to assess the quality of the results. Our measures are based on the idea of identifying robust subgroups, inspired by biologically/clinically relevance instead of simply aiming at well-separated clusters. We evaluate our procedure using four real independent data sets. In these data sets, our method was able to find accurate subgroups with individual molecular and clinical features and outperformed the standard NMF in terms of accuracy in the factorization fitness function. Hence, it can be useful for the discovery of subgroups of patients with similar CN profiles in the study of heterogeneous diseases.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

BACKGROUND: Urothelial pathogenesis is a complex process driven by an underlying network of interconnected genes. The identification of novel genomic target regions and gene targets that drive urothelial carcinogenesis is crucial in order to improve our current limited understanding of urothelial cancer (UC) on the molecular level. The inference of genome-wide gene regulatory networks (GRN) from large-scale gene expression data provides a promising approach for a detailed investigation of the underlying network structure associated to urothelial carcinogenesis.

METHODS: In our study we inferred and compared three GRNs by the application of the BC3Net inference algorithm to large-scale transitional cell carcinoma gene expression data sets from Illumina RNAseq (179 samples), Illumina Bead arrays (165 samples) and Affymetrix Oligo microarrays (188 samples). We investigated the structural and functional properties of GRNs for the identification of molecular targets associated to urothelial cancer.

RESULTS: We found that the urothelial cancer (UC) GRNs show a significant enrichment of subnetworks that are associated with known cancer hallmarks including cell cycle, immune response, signaling, differentiation and translation. Interestingly, the most prominent subnetworks of co-located genes were found on chromosome regions 5q31.3 (RNAseq), 8q24.3 (Oligo) and 1q23.3 (Bead), which all represent known genomic regions frequently deregulated or aberated in urothelial cancer and other cancer types. Furthermore, the identified hub genes of the individual GRNs, e.g., HID1/DMC1 (tumor development), RNF17/TDRD4 (cancer antigen) and CYP4A11 (angiogenesis/ metastasis) are known cancer associated markers. The GRNs were highly dataset specific on the interaction level between individual genes, but showed large similarities on the biological function level represented by subnetworks. Remarkably, the RNAseq UC GRN showed twice the proportion of significant functional subnetworks. Based on our analysis of inferential and experimental networks the Bead UC GRN showed the lowest performance compared to the RNAseq and Oligo UC GRNs.

CONCLUSION: To our knowledge, this is the first study investigating genome-scale UC GRNs. RNAseq based gene expression data is the data platform of choice for a GRN inference. Our study offers new avenues for the identification of novel putative diagnostic targets for subsequent studies in bladder tumors.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The popularity of tri-axial accelerometer data loggers to quantify animal activity through the analysis of signature traces is increasing. However, there is no consensus on how to process the large data sets that these devices generate when recording at the necessary high sample rates. In addition, there have been few attempts to validate accelerometer traces with specific behaviours in non-domesticated terrestrial mammals.