850 resultados para High-dimensional data visualization
Resumo:
M. Neal, An Artificial Immune System for Continuous Analysis of Time-Varying Data, in Proceedings of the 1st International Conference on Artificial Immune Systems (ICARIS), 2002, eds J Timmis and P J Bentley, volume 1, pages 76-85,
Resumo:
Timmis J and Neal M J. A resource limited artificial immune system for data analysis. In Proceedings of ES2000 - Research and Development of Intelligent Systems, pages 19-32, Cambrige, U.K., 2000. Springer.
Resumo:
R. Marti, C. Rubin, E. Denton and R. Zwiggelaar, '2D-3D correspondence in mammography', Cybernetics and Systems 35 (1), 85-105 (2004)
Resumo:
Anomalies are unusual and significant changes in a network's traffic levels, which can often involve multiple links. Diagnosing anomalies is critical for both network operators and end users. It is a difficult problem because one must extract and interpret anomalous patterns from large amounts of high-dimensional, noisy data. In this paper we propose a general method to diagnose anomalies. This method is based on a separation of the high-dimensional space occupied by a set of network traffic measurements into disjoint subspaces corresponding to normal and anomalous network conditions. We show that this separation can be performed effectively using Principal Component Analysis. Using only simple traffic measurements from links, we study volume anomalies and show that the method can: (1) accurately detect when a volume anomaly is occurring; (2) correctly identify the underlying origin-destination (OD) flow which is the source of the anomaly; and (3) accurately estimate the amount of traffic involved in the anomalous OD flow. We evaluate the method's ability to diagnose (i.e., detect, identify, and quantify) both existing and synthetically injected volume anomalies in real traffic from two backbone networks. Our method consistently diagnoses the largest volume anomalies, and does so with a very low false alarm rate.
Resumo:
Current low-level networking abstractions on modern operating systems are commonly implemented in the kernel to provide sufficient performance for general purpose applications. However, it is desirable for high performance applications to have more control over the networking subsystem to support optimizations for their specific needs. One approach is to allow networking services to be implemented at user-level. Unfortunately, this typically incurs costs due to scheduling overheads and unnecessary data copying via the kernel. In this paper, we describe a method to implement efficient application-specific network service extensions at user-level, that removes the cost of scheduling and provides protected access to lower-level system abstractions. We present a networking implementation that, with minor modifications to the Linux kernel, passes data between "sandboxed" extensions and the Ethernet device without copying or processing in the kernel. Using this mechanism, we put a customizable networking stack into a user-level sandbox and show how it can be used to efficiently process and forward data via proxies, or intermediate hosts, in the communication path of high performance data streams. Unlike other user-level networking implementations, our method makes no special hardware requirements to avoid unnecessary data copies. Results show that we achieve a substantial increase in throughput over comparable user-space methods using our networking stack implementation.
Resumo:
We investigate adaptive buffer management techniques for approximate evaluation of sliding window joins over multiple data streams. In many applications, data stream processing systems have limited memory or have to deal with very high speed data streams. In both cases, computing the exact results of joins between these streams may not be feasible, mainly because the buffers used to compute the joins contain much smaller number of tuples than the tuples contained in the sliding windows. Therefore, a stream buffer management policy is needed in that case. We show that the buffer replacement policy is an important determinant of the quality of the produced results. To that end, we propose GreedyDual-Join (GDJ) an adaptive and locality-aware buffering technique for managing these buffers. GDJ exploits the temporal correlations (at both long and short time scales), which we found to be prevalent in many real data streams. We note that our algorithm is readily applicable to multiple data streams and multiple joins and requires almost no additional system resources. We report results of an experimental study using both synthetic and real-world data sets. Our results demonstrate the superiority and flexibility of our approach when contrasted to other recently proposed techniques.
Resumo:
Computational models of learning typically train on labeled input patterns (supervised learning), unlabeled input patterns (unsupervised learning), or a combination of the two (semisupervised learning). In each case input patterns have a fixed number of features throughout training and testing. Human and machine learning contexts present additional opportunities for expanding incomplete knowledge from formal training, via self-directed learning that incorporates features not previously experienced. This article defines a new self-supervised learning paradigm to address these richer learning contexts, introducing a neural network called self-supervised ARTMAP. Self-supervised learning integrates knowledge from a teacher (labeled patterns with some features), knowledge from the environment (unlabeled patterns with more features), and knowledge from internal model activation (self-labeled patterns). Self-supervised ARTMAP learns about novel features from unlabeled patterns without destroying partial knowledge previously acquired from labeled patterns. A category selection function bases system predictions on known features, and distributed network activation scales unlabeled learning to prediction confidence. Slow distributed learning on unlabeled patterns focuses on novel features and confident predictions, defining classification boundaries that were ambiguous in the labeled patterns. Self-supervised ARTMAP improves test accuracy on illustrative lowdimensional problems and on high-dimensional benchmarks. Model code and benchmark data are available from: http://techlab.bu.edu/SSART/.
Resumo:
This article describes advances in statistical computation for large-scale data analysis in structured Bayesian mixture models via graphics processing unit (GPU) programming. The developments are partly motivated by computational challenges arising in fitting models of increasing heterogeneity to increasingly large datasets. An example context concerns common biological studies using high-throughput technologies generating many, very large datasets and requiring increasingly high-dimensional mixture models with large numbers of mixture components.We outline important strategies and processes for GPU computation in Bayesian simulation and optimization approaches, give examples of the benefits of GPU implementations in terms of processing speed and scale-up in ability to analyze large datasets, and provide a detailed, tutorial-style exposition that will benefit readers interested in developing GPU-based approaches in other statistical models. Novel, GPU-oriented approaches to modifying existing algorithms software design can lead to vast speed-up and, critically, enable statistical analyses that presently will not be performed due to compute time limitations in traditional computational environments. Supplementalmaterials are provided with all source code, example data, and details that will enable readers to implement and explore the GPU approach in this mixture modeling context. © 2010 American Statistical Association, Institute of Mathematical Statistics, and Interface Foundation of North America.
Resumo:
We propose a novel unsupervised approach for linking records across arbitrarily many files, while simultaneously detecting duplicate records within files. Our key innovation is to represent the pattern of links between records as a {\em bipartite} graph, in which records are directly linked to latent true individuals, and only indirectly linked to other records. This flexible new representation of the linkage structure naturally allows us to estimate the attributes of the unique observable people in the population, calculate $k$-way posterior probabilities of matches across records, and propagate the uncertainty of record linkage into later analyses. Our linkage structure lends itself to an efficient, linear-time, hybrid Markov chain Monte Carlo algorithm, which overcomes many obstacles encountered by previously proposed methods of record linkage, despite the high dimensional parameter space. We assess our results on real and simulated data.
Resumo:
Of key importance to oil and gas companies is the size distribution of fields in the areas that they are drilling. Recent arguments suggest that there are many more fields yet to be discovered in mature provinces than had previously been thought because the underlying distribution is monotonic not peaked. According to this view the peaked nature of the distribution for discovered fields reflects not the underlying distribution but the effect of economic truncation. This paper contributes to the discussion by analysing up-to-date exploration and discovery data for two mature provinces using the discovery-process model, based on sampling without replacement and implicitly including economic truncation effects. The maximum likelihood estimation involved generates a high-dimensional mixed-integer nonlinear optimization problem. A highly efficient solution strategy is tested, exploiting the separable structure and handling the integer constraints by treating the problem as a masked allocation problem in dynamic programming.
Resumo:
Phytoplankton observation is the product of a number of trade-offs related to sampling processes, required level of diversity and size spectrum analysis capabilities of the techniques involved. Instruments combining the morphological and high-frequency analysis for phytoplankton cells are now available. This paper presents an application of the automated high-resolution flow cytometer Cytosub as a tool for analysing phytoplanktonic cells in their natural environment. High resolution data from a temporal study in the Bay of Marseille (analysis every 30 min over 1 month) and a spatial study in the Southern Indian Ocean (analysis every 5 min at 10 knots over 5 days) are presented to illustrate the capabilities and limitations of the instrument. Automated high-frequency flow cytometry revealed the spatial and temporal variability of phytoplankton in the size range 1−∼50 μm that could not be resolved otherwise. Due to some limitations (instrumental memory, volume analysed per sample), recorded counts could be statistically too low. By combining high-frequency consecutive samples, it is possible to decrease the counting error, following Poisson’s law, and to retain the main features of phytoplankton variability. With this technique, the analysis of phytoplankton variability combines adequate sampling frequency and effective monitoring of community changes.
Resumo:
Physical oceanography is the study of physical conditions, processes and variables within the ocean, including temperature-salinity distributions, mixing of the water column, waves, tides, currents, and air-sea interaction processes. Here we provide a critical review of how satellite sensors are being used to study physical oceanography processes at the ocean surface and its borders with the atmosphere and sea-ice. The paper begins by describing the main sensor types that are used to observe the oceans (visible, thermal infrared and microwave) and the specific observations that each of these sensor types can provide. We then present a critical review of how these sensors and observations are being used to study i) ocean surface currents, ii) storm surges, iii) sea-ice, iv) atmosphere-ocean gas exchange and v) surface heat fluxes via phytoplankton. Exciting advances include the use of multiple sensors in synergy to observe temporally varying Arctic sea-ice volume, atmosphere- ocean gas fluxes, and the potential for 4 dimensional water circulation observations. For each of these applications we explain their relevance to society, review recent advances and capability, and provide a forward look at future prospects and opportunities. We then more generally discuss future opportunities for oceanography-focussed remote-sensing, which includes the unique European Union Copernicus programme, the potential of the International Space Station and commercial miniature satellites. The increasing availability of global satellite remote-sensing observations means that we are now entering an exciting period for oceanography. The easy access to these high quality data and the continued development of novel platforms is likely to drive further advances in remote sensing of the ocean and atmospheric systems.
Resumo:
This work analyzes the relationship between large food webs describing potential feeding relations between species and smaller sub-webs thereof describing relations actually realized in local communities of various sizes. Special attention is given to the relationships between patterns of phylogenetic correlations encountered in large webs and sub-webs. Based on the current theory of food-web topology as implemented in the matching model, it is shown that food webs are scale invariant in the following sense: given a large web described by the model, a smaller, randomly sampled sub-web thereof is described by the model as well. A stochastic analysis of model steady states reveals that such a change in scale goes along with a re-normalization of model parameters. Explicit formulae for the renormalized parameters are derived. Thus, the topology of food webs at all scales follows the same patterns, and these can be revealed by data and models referring to the local scale alone. As a by-product of the theory, a fast algorithm is derived which yields sample food webs from the exact steady state of the matching model for a high-dimensional trophic niche space in finite time. (C) 2008 Elsevier B.V. All rights reserved.
Resumo:
This works follows a publication of our group in J. Chem. Eng. Data2007, 52, 2204–2211 presenting high temperature and pressure density data for five imidazolium-based ionic liquids. At this period, very few ionic liquid density data were available in the literature, especially at high pressure, and the uncertainty of published results was calculated with respect to the literature data available for three of the five ionic liquids studied. Since 2007, the ionic liquid density databank has largely increased. In this work, a comparison of our published data in J. Chem. Eng. Data2007, 52, 2204–2211, with more than 1800 high pressure data coming from the literature up to December 2011 is presented to assess the uncertainty of our published values. The claimed uncertainty is close to 0.31 % for all IL density data sets except in the case of the [C1C2Im][EtSO4], where the uncertainty is up to 1.1 %. Reported data in J. Chem. Eng. Data2007, 52, 2204–2211, for this particular ionic liquid cannot be used as a reference. For this ionic liquid, new density measurements of the same sample batch have been remeasured by using the same experimental technique, and new experimental data presented herein are clearly higher than our previous published results. A 1H NMR analysis of the sample has confirmed hydrolysis of the ethylsulfate anion to ethanol and hydrogenate anion which explains the differences observed between our density data and the literature.
Resumo:
Dynamic switching spectroscopy piezoresponse force microscopy is developed to separate thermodynamic and kinetic effects in local bias-induced phase transitions. The approaches for visualization and analysis of five-dimensional data are discussed. The spatial and voltage variability of relaxation behavior of the a-c domain lead zirconate-titanate surface suggest the interpretation in terms of surface charge dynamics. This approach is applicable to local studies of dynamic behavior in any system with reversible bias-induced phase transitions ranging from ferroelectrics and multiferroics to ionic systems such as batteries, fuel cells, and electroresistive materials. (C) 2011 American Institute of Physics. [doi:10.1063/1.3590919]