960 resultados para data sets


Relevância:

100.00% 100.00%

Publicador:

Resumo:

The stratospheric mean-meridional circulation (MMC) and eddy mixing are compared among six meteorological reanalysis data sets: NCEP-NCAR, NCEP-CFSR, ERA-40, ERA-Interim, JRA-25, and JRA-55 for the period 1979–2012. The reanalysis data sets produced using advanced systems (i.e., NCEP-CFSR, ERA-Interim, and JRA-55) generally reveal a weaker MMC in the Northern Hemisphere (NH) compared with those produced using older systems (i.e., NCEP/NCAR, ERA-40, and JRA-25). The mean mixing strength differs largely among the data products. In the NH lower stratosphere, the contribution of planetary-scale mixing is larger in the new data sets than in the old data sets, whereas that of small-scale mixing is weaker in the new data sets. Conventional data assimilation techniques introduce analysis increments without maintaining physical balance, which may have caused an overly strong MMC and spurious small-scale eddies in the old data sets. At the NH mid-latitudes, only ERA-Interim reveals a weakening MMC trend in the deep branch of the Brewer–Dobson circulation (BDC). The relative importance of the eddy mixing compared with the mean-meridional transport in the subtropical lower stratosphere shows increasing trends in ERA-Interim and JRA-55; this together with the weakened MMC in the deep branch may imply an increasing age-of-air (AoA) in the NH middle stratosphere in ERA-Interim. Overall, discrepancies between the different variables and trends therein as derived from the different reanalyses are still relatively large, suggesting that more investments in these products are needed in order to obtain a consolidated picture of observed changes in the BDC and the mechanisms that drive them.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In Information Visualization, adding and removing data elements can strongly impact the underlying visual space. We have developed an inherently incremental technique (incBoard) that maintains a coherent disposition of elements from a dynamic multidimensional data set on a 2D grid as the set changes. Here, we introduce a novel layout that uses pairwise similarity from grid neighbors, as defined in incBoard, to reposition elements on the visual space, free from constraints imposed by the grid. The board continues to be updated and can be displayed alongside the new space. As similar items are placed together, while dissimilar neighbors are moved apart, it supports users in the identification of clusters and subsets of related elements. Densely populated areas identified in the incSpace can be efficiently explored with the corresponding incBoard visualization, which is not susceptible to occlusion. The solution remains inherently incremental and maintains a coherent disposition of elements, even for fully renewed sets. The algorithm considers relative positions for the initial placement of elements, and raw dissimilarity to fine tune the visualization. It has low computational cost, with complexity depending only on the size of the currently viewed subset, V. Thus, a data set of size N can be sequentially displayed in O(N) time, reaching O(N (2)) only if the complete set is simultaneously displayed.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Most multidimensional projection techniques rely on distance (dissimilarity) information between data instances to embed high-dimensional data into a visual space. When data are endowed with Cartesian coordinates, an extra computational effort is necessary to compute the needed distances, making multidimensional projection prohibitive in applications dealing with interactivity and massive data. The novel multidimensional projection technique proposed in this work, called Part-Linear Multidimensional Projection (PLMP), has been tailored to handle multivariate data represented in Cartesian high-dimensional spaces, requiring only distance information between pairs of representative samples. This characteristic renders PLMP faster than previous methods when processing large data sets while still being competitive in terms of precision. Moreover, knowing the range of variation for data instances in the high-dimensional space, we can make PLMP a truly streaming data projection technique, a trait absent in previous methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

This paper reviews the appropriateness for application to large data sets of standard machine learning algorithms, which were mainly developed in the context of small data sets. Sampling and parallelisation have proved useful means for reducing computation time when learning from large data sets. However, such methods assume that algorithms that were designed for use with what are now considered small data sets are also fundamentally suitable for large data sets. It is plausible that optimal learning from large data sets requires a different type of algorithm to optimal learning from small data sets. This paper investigates one respect in which data set size may affect the requirements of a learning algorithm — the bias plus variance decomposition of classification error. Experiments show that learning from large data sets may be more effective when using an algorithm that places greater emphasis on bias management, rather than variance management.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Missing data imputation is a key issue in learning from incomplete data. Various techniques have been developed with great successes on dealing with missing values in data sets with homogeneous attributes (their independent attributes are all either continuous or discrete). This paper studies a new setting of missing data imputation, i.e., imputing missing data in data sets with heterogeneous attributes (their independent attributes are of different types), referred to as imputing mixed-attribute data sets. Although many real applications are in this setting, there is no estimator designed for imputing mixed-attribute data sets. This paper first proposes two consistent estimators for discrete and continuous missing target values, respectively. And then, a mixture-kernel-based iterative estimator is advocated to impute mixed-attribute data sets. The proposed method is evaluated with extensive experiments compared with some typical algorithms, and the result demonstrates that the proposed approach is better than these existing imputation methods in terms of classification accuracy and root mean square error (RMSE) at different missing ratios.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Energy consumption data are required to perform analysis, modelling, evaluation, and optimisation of energy usage in buildings. While a variety of energy consumption data sets have been examined and reported in the literature, there is a lack of a comprehensive categorisation and analysis of the available data sets. In this study, an overview of energy consumption data of buildings is provided. Three common strategies for generating energy consumption data, i.e., measurement, survey, and simulation, are described. A number of important characteristics pertaining to each strategy and the resulting data sets are discussed. In addition, a directory of energy consumption data sets of buildings is developed. The data sets are collected from either published papers or energy related organisations. The main contributions of this study include establishing a resource pertaining to energy consumption data sets and providing information related to the characteristics and availability of the respective data sets; therefore facilitating and promoting research activities in energy consumption data analysis.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Empirical phylogeographic studies have progressively sampled greater numbers of loci over time, in part motivated by theoretical papers showing that estimates of key demographic parameters improve as the number of loci increases. Recently, next-generation sequencing has been applied to questions about organismal history, with the promise of revolutionizing the field. However, no systematic assessment of how phylogeographic data sets have changed over time with respect to overall size and information content has been performed. Here, we quantify the changing nature of these genetic data sets over the past 20years, focusing on papers published in Molecular Ecology. We found that the number of independent loci, the total number of alleles sampled and the total number of single nucleotide polymorphisms (SNPs) per data set has improved over time, with particularly dramatic increases within the past 5years. Interestingly, uniparentally inherited organellar markers (e.g. animal mitochondrial and plant chloroplast DNA) continue to represent an important component of phylogeographic data. Single-species studies (cf. comparative studies) that focus on vertebrates (particularly fish and to some extent, birds) represent the gold standard of phylogeographic data collection. Based on the current trajectory seen in our survey data, forecast modelling indicates that the median number of SNPs per data set for studies published by the end of the year 2016 may approach similar to 20000. This survey provides baseline information for understanding the evolution of phylogeographic data sets and underscores the fact that development of analytical methods for handling very large genetic data sets will be critical for facilitating growth of the field.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Hundreds of Terabytes of CMS (Compact Muon Solenoid) data are being accumulated for storage day by day at the University of Nebraska-Lincoln, which is one of the eight US CMS Tier-2 sites. Managing this data includes retaining useful CMS data sets and clearing storage space for newly arriving data by deleting less useful data sets. This is an important task that is currently being done manually and it requires a large amount of time. The overall objective of this study was to develop a methodology to help identify the data sets to be deleted when there is a requirement for storage space. CMS data is stored using HDFS (Hadoop Distributed File System). HDFS logs give information regarding file access operations. Hadoop MapReduce was used to feed information in these logs to Support Vector Machines (SVMs), a machine learning algorithm applicable to classification and regression which is used in this Thesis to develop a classifier. Time elapsed in data set classification by this method is dependent on the size of the input HDFS log file since the algorithmic complexities of Hadoop MapReduce algorithms here are O(n). The SVM methodology produces a list of data sets for deletion along with their respective sizes. This methodology was also compared with a heuristic called Retention Cost which was calculated using size of the data set and the time since its last access to help decide how useful a data set is. Accuracies of both were compared by calculating the percentage of data sets predicted for deletion which were accessed at a later instance of time. Our methodology using SVMs proved to be more accurate than using the Retention Cost heuristic. This methodology could be used to solve similar problems involving other large data sets.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

OBJECTIVE: The aim of this study was to evaluate soft tissue image quality of a mobile cone-beam computed tomography (CBCT) scanner with an integrated flat-panel detector. STUDY DESIGN: Eight fresh human cadavers were used in this study. For evaluation of soft tissue visualization, CBCT data sets and corresponding computed tomography (CT) and magnetic resonance imaging (MRI) data sets were acquired. Evaluation was performed with the help of 10 defined cervical anatomical structures. RESULTS: The statistical analysis of the scoring results of 3 examiners revealed the CBCT images to be of inferior quality regarding the visualization of most of the predefined structures. Visualization without a significant difference was found regarding the demarcation of the vertebral bodies and the pyramidal cartilages, the arteriosclerosis of the carotids (compared with CT), and the laryngeal skeleton (compared with MRI). Regarding arteriosclerosis of the carotids compared with MRI, CBCT proved to be superior. CONCLUSIONS: The integration of a flat-panel detector improves soft tissue visualization using a mobile CBCT scanner.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Nitrogen and water are essential for plant growth and development. In this study, we designed experiments to produce gene expression data of poplar roots under nitrogen starvation and water deprivation conditions. We found low concentration of nitrogen led first to increased root elongation followed by lateral root proliferation and eventually increased root biomass. To identify genes regulating root growth and development under nitrogen starvation and water deprivation, we designed a series of data analysis procedures, through which, we have successfully identified biologically important genes. Differentially Expressed Genes (DEGs) analysis identified the genes that are differentially expressed under nitrogen starvation or drought. Protein domain enrichment analysis identified enriched themes (in same domains) that are highly interactive during the treatment. Gene Ontology (GO) enrichment analysis allowed us to identify biological process changed during nitrogen starvation. Based on the above analyses, we examined the local Gene Regulatory Network (GRN) and identified a number of transcription factors. After testing, one of them is a high hierarchically ranked transcription factor that affects root growth under nitrogen starvation. It is very tedious and time-consuming to analyze gene expression data. To avoid doing analysis manually, we attempt to automate a computational pipeline that now can be used for identification of DEGs and protein domain analysis in a single run. It is implemented in scripts of Perl and R.