896 resultados para High-dimensional data visualization
                                
Resumo:
Highlights of Data Expedition: • Students explored daily observations of local climate data spanning the past 35 years. • Topological Data Analysis, or TDA for short, provides cutting-edge tools for studying the geometry of data in arbitrarily high dimensions. • Using TDA tools, students discovered intrinsic dynamical features of the data and learned how to quantify periodic phenomenon in a time-series. • Since nature invariably produces noisy data which rarely has exact periodicity, students also considered the theoretical basis of almost-periodicity and even invented and tested new mathematical definitions of almost-periodic functions. Summary The dataset we used for this data expedition comes from the Global Historical Climatology Network. “GHCN (Global Historical Climatology Network)-Daily is an integrated database of daily climate summaries from land surface stations across the globe.” Source: https://www.ncdc.noaa.gov/oa/climate/ghcn-daily/ We focused on the daily maximum and minimum temperatures from January 1, 1980 to April 1, 2015 collected from RDU International Airport. Through a guided series of exercises designed to be performed in Matlab, students explore these time-series, initially by direct visualization and basic statistical techniques. Then students are guided through a special sliding-window construction which transforms a time-series into a high-dimensional geometric curve. These high-dimensional curves can be visualized by projecting down to lower dimensions as in the figure below (Figure 1), however, our focus here was to use persistent homology to directly study the high-dimensional embedding. The shape of these curves has meaningful information but how one describes the “shape” of data depends on which scale the data is being considered. However, choosing the appropriate scale is rarely an obvious choice. Persistent homology overcomes this obstacle by allowing us to quantitatively study geometric features of the data across multiple-scales. Through this data expedition, students are introduced to numerically computing persistent homology using the rips collapse algorithm and interpreting the results. In the specific context of sliding-window constructions, 1-dimensional persistent homology can reveal the nature of periodic structure in the original data. I created a special technique to study how these high-dimensional sliding-window curves form loops in order to quantify the periodicity. Students are guided through this construction and learn how to visualize and interpret this information. Climate data is extremely complex (as anyone who has suffered from a bad weather prediction can attest) and numerous variables play a role in determining our daily weather and temperatures. This complexity coupled with imperfections of measuring devices results in very noisy data. This causes the annual seasonal periodicity to be far from exact. To this end, I have students explore existing theoretical notions of almost-periodicity and test it on the data. They find that some existing definitions are also inadequate in this context. Hence I challenged them to invent new mathematics by proposing and testing their own definition. These students rose to the challenge and suggested a number of creative definitions. While autocorrelation and spectral methods based on Fourier analysis are often used to explore periodicity, the construction here provides an alternative paradigm to quantify periodic structure in almost-periodic signals using tools from topological data analysis.
                                
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08
                                
Resumo:
Internet users consume online targeted advertising based on information collected about them and voluntarily share personal information in social networks. Sensor information and data from smart-phones is collected and used by applications, sometimes in unclear ways. As it happens today with smartphones, in the near future sensors will be shipped in all types of connected devices, enabling ubiquitous information gathering from the physical environment, enabling the vision of Ambient Intelligence. The value of gathered data, if not obvious, can be harnessed through data mining techniques and put to use by enabling personalized and tailored services as well as business intelligence practices, fueling the digital economy. However, the ever-expanding information gathering and use undermines the privacy conceptions of the past. Natural social practices of managing privacy in daily relations are overridden by socially-awkward communication tools, service providers struggle with security issues resulting in harmful data leaks, governments use mass surveillance techniques, the incentives of the digital economy threaten consumer privacy, and the advancement of consumergrade data-gathering technology enables new inter-personal abuses. A wide range of fields attempts to address technology-related privacy problems, however they vary immensely in terms of assumptions, scope and approach. Privacy of future use cases is typically handled vertically, instead of building upon previous work that can be re-contextualized, while current privacy problems are typically addressed per type in a more focused way. Because significant effort was required to make sense of the relations and structure of privacy-related work, this thesis attempts to transmit a structured view of it. It is multi-disciplinary - from cryptography to economics, including distributed systems and information theory - and addresses privacy issues of different natures. As existing work is framed and discussed, the contributions to the state-of-theart done in the scope of this thesis are presented. The contributions add to five distinct areas: 1) identity in distributed systems; 2) future context-aware services; 3) event-based context management; 4) low-latency information flow control; 5) high-dimensional dataset anonymity. Finally, having laid out such landscape of the privacy-preserving work, the current and future privacy challenges are discussed, considering not only technical but also socio-economic perspectives.
                                
Resumo:
Although large-scale public hypermedia structures such as the World Wide Web are popularly referred to as "cyberspace", the extent to which they constitute a space in the everyday sense of the word is questionable. This paper reviews recent work in the area of three dimensional (3D) visualization of the Web that has attempted to depict it in the form of a recognizable space; in other words, as a navigable landscape that may be visibly populated by its users. Our review begins by introducing a range of visualizations that address different aspects of using the Web. These include visualizations of Web structure, especially of links, that act as 3D maps; browsing history; searches; evolution of the Web; and the presence and activities of multiple users. We then summarize the different techniques that are employed by these visualizations. We conclude with a discussion of key challenges for the future.
                                
Resumo:
Personal information is increasingly gathered and used for providing services tailored to user preferences, but the datasets used to provide such functionality can represent serious privacy threats if not appropriately protected. Work in privacy-preserving data publishing targeted privacy guarantees that protect against record re-identification, by making records indistinguishable, or sensitive attribute value disclosure, by introducing diversity or noise in the sensitive values. However, most approaches fail in the high-dimensional case, and the ones that don’t introduce a utility cost incompatible with tailored recommendation scenarios. This paper aims at a sensible trade-off between privacy and the benefits of tailored recommendations, in the context of privacy-preserving data publishing. We empirically demonstrate that significant privacy improvements can be achieved at a utility cost compatible with tailored recommendation scenarios, using a simple partition-based sanitization method.
                                
Resumo:
Sequences of timestamped events are currently being generated across nearly every domain of data analytics, from e-commerce web logging to electronic health records used by doctors and medical researchers. Every day, this data type is reviewed by humans who apply statistical tests, hoping to learn everything they can about how these processes work, why they break, and how they can be improved upon. To further uncover how these processes work the way they do, researchers often compare two groups, or cohorts, of event sequences to find the differences and similarities between outcomes and processes. With temporal event sequence data, this task is complex because of the variety of ways single events and sequences of events can differ between the two cohorts of records: the structure of the event sequences (e.g., event order, co-occurring events, or frequencies of events), the attributes about the events and records (e.g., gender of a patient), or metrics about the timestamps themselves (e.g., duration of an event). Running statistical tests to cover all these cases and determining which results are significant becomes cumbersome. Current visual analytics tools for comparing groups of event sequences emphasize a purely statistical or purely visual approach for comparison. Visual analytics tools leverage humans' ability to easily see patterns and anomalies that they were not expecting, but is limited by uncertainty in findings. Statistical tools emphasize finding significant differences in the data, but often requires researchers have a concrete question and doesn't facilitate more general exploration of the data. Combining visual analytics tools with statistical methods leverages the benefits of both approaches for quicker and easier insight discovery. Integrating statistics into a visualization tool presents many challenges on the frontend (e.g., displaying the results of many different metrics concisely) and in the backend (e.g., scalability challenges with running various metrics on multi-dimensional data at once). I begin by exploring the problem of comparing cohorts of event sequences and understanding the questions that analysts commonly ask in this task. From there, I demonstrate that combining automated statistics with an interactive user interface amplifies the benefits of both types of tools, thereby enabling analysts to conduct quicker and easier data exploration, hypothesis generation, and insight discovery. The direct contributions of this dissertation are: (1) a taxonomy of metrics for comparing cohorts of temporal event sequences, (2) a statistical framework for exploratory data analysis with a method I refer to as high-volume hypothesis testing (HVHT), (3) a family of visualizations and guidelines for interaction techniques that are useful for understanding and parsing the results, and (4) a user study, five long-term case studies, and five short-term case studies which demonstrate the utility and impact of these methods in various domains: four in the medical domain, one in web log analysis, two in education, and one each in social networks, sports analytics, and security. My dissertation contributes an understanding of how cohorts of temporal event sequences are commonly compared and the difficulties associated with applying and parsing the results of these metrics. It also contributes a set of visualizations, algorithms, and design guidelines for balancing automated statistics with user-driven analysis to guide users to significant, distinguishing features between cohorts. This work opens avenues for future research in comparing two or more groups of temporal event sequences, opening traditional machine learning and data mining techniques to user interaction, and extending the principles found in this dissertation to data types beyond temporal event sequences.
                                
Resumo:
Nowadays, new computers generation provides a high performance that enables to build computationally expensive computer vision applications applied to mobile robotics. Building a map of the environment is a common task of a robot and is an essential part to allow the robots to move through these environments. Traditionally, mobile robots used a combination of several sensors from different technologies. Lasers, sonars and contact sensors have been typically used in any mobile robotic architecture, however color cameras are an important sensor due to we want the robots to use the same information that humans to sense and move through the different environments. Color cameras are cheap and flexible but a lot of work need to be done to give robots enough visual understanding of the scenes. Computer vision algorithms are computational complex problems but nowadays robots have access to different and powerful architectures that can be used for mobile robotics purposes. The advent of low-cost RGB-D sensors like Microsoft Kinect which provide 3D colored point clouds at high frame rates made the computer vision even more relevant in the mobile robotics field. The combination of visual and 3D data allows the systems to use both computer vision and 3D processing and therefore to be aware of more details of the surrounding environment. The research described in this thesis was motivated by the need of scene mapping. Being aware of the surrounding environment is a key feature in many mobile robotics applications from simple robotic navigation to complex surveillance applications. In addition, the acquisition of a 3D model of the scenes is useful in many areas as video games scene modeling where well-known places are reconstructed and added to game systems or advertising where once you get the 3D model of one room the system can add furniture pieces using augmented reality techniques. In this thesis we perform an experimental study of the state-of-the-art registration methods to find which one fits better to our scene mapping purposes. Different methods are tested and analyzed on different scene distributions of visual and geometry appearance. In addition, this thesis proposes two methods for 3d data compression and representation of 3D maps. Our 3D representation proposal is based on the use of Growing Neural Gas (GNG) method. This Self-Organizing Maps (SOMs) has been successfully used for clustering, pattern recognition and topology representation of various kind of data. Until now, Self-Organizing Maps have been primarily computed offline and their application in 3D data has mainly focused on free noise models without considering time constraints. Self-organising neural models have the ability to provide a good representation of the input space. In particular, the Growing Neural Gas (GNG) is a suitable model because of its flexibility, rapid adaptation and excellent quality of representation. However, this type of learning is time consuming, specially for high-dimensional input data. Since real applications often work under time constraints, it is necessary to adapt the learning process in order to complete it in a predefined time. This thesis proposes a hardware implementation leveraging the computing power of modern GPUs which takes advantage of a new paradigm coined as General-Purpose Computing on Graphics Processing Units (GPGPU). Our proposed geometrical 3D compression method seeks to reduce the 3D information using plane detection as basic structure to compress the data. This is due to our target environments are man-made and therefore there are a lot of points that belong to a plane surface. Our proposed method is able to get good compression results in those man-made scenarios. The detected and compressed planes can be also used in other applications as surface reconstruction or plane-based registration algorithms. Finally, we have also demonstrated the goodness of the GPU technologies getting a high performance implementation of a CAD/CAM common technique called Virtual Digitizing.
                                
Resumo:
Event extraction from texts aims to detect structured information such as what has happened, to whom, where and when. Event extraction and visualization are typically considered as two different tasks. In this paper, we propose a novel approach based on probabilistic modelling to jointly extract and visualize events from tweets where both tasks benefit from each other. We model each event as a joint distribution over named entities, a date, a location and event-related keywords. Moreover, both tweets and event instances are associated with coordinates in the visualization space. The manifold assumption that the intrinsic geometry of tweets is a low-rank, non-linear manifold within the high-dimensional space is incorporated into the learning framework using a regularization. Experimental results show that the proposed approach can effectively deal with both event extraction and visualization and performs remarkably better than both the state-of-the-art event extraction method and a pipeline approach for event extraction and visualization.
                                
Resumo:
This thesis is concerned with change point analysis for time series, i.e. with detection of structural breaks in time-ordered, random data. This long-standing research field regained popularity over the last few years and is still undergoing, as statistical analysis in general, a transformation to high-dimensional problems. We focus on the fundamental »change in the mean« problem and provide extensions of the classical non-parametric Darling-Erdős-type cumulative sum (CUSUM) testing and estimation theory within highdimensional Hilbert space settings. In the first part we contribute to (long run) principal component based testing methods for Hilbert space valued time series under a rather broad (abrupt, epidemic, gradual, multiple) change setting and under dependence. For the dependence structure we consider either traditional m-dependence assumptions or more recently developed m-approximability conditions which cover, e.g., MA, AR and ARCH models. We derive Gumbel and Brownian bridge type approximations of the distribution of the test statistic under the null hypothesis of no change and consistency conditions under the alternative. A new formulation of the test statistic using projections on subspaces allows us to simplify the standard proof techniques and to weaken common assumptions on the covariance structure. Furthermore, we propose to adjust the principal components by an implicit estimation of a (possible) change direction. This approach adds flexibility to projection based methods, weakens typical technical conditions and provides better consistency properties under the alternative. In the second part we contribute to estimation methods for common changes in the means of panels of Hilbert space valued time series. We analyze weighted CUSUM estimates within a recently proposed »high-dimensional low sample size (HDLSS)« framework, where the sample size is fixed but the number of panels increases. We derive sharp conditions on »pointwise asymptotic accuracy« or »uniform asymptotic accuracy« of those estimates in terms of the weighting function. Particularly, we prove that a covariance-based correction of Darling-Erdős-type CUSUM estimates is required to guarantee uniform asymptotic accuracy under moderate dependence conditions within panels and that these conditions are fulfilled, e.g., by any MA(1) time series. As a counterexample we show that for AR(1) time series, close to the non-stationary case, the dependence is too strong and uniform asymptotic accuracy cannot be ensured. Finally, we conduct simulations to demonstrate that our results are practically applicable and that our methodological suggestions are advantageous.
                                
Resumo:
The utilization of wood from reforested species by the furniture industry is a recent trend. Thus, the present study determined the specific gravity and shrinkage of wood of 18-year-old Eucalyptus grandis, Eucalyptus dunnii and Eucalyptus urophylla, for use as components in solid wood furniture making. The tests to evaluate the specific gravity and shrinkage of wood in the radial and axial variation of the eucalyptus trees were performed according to NBR 7190/96. The results of the analysis of wood from eucalypt species were subjected to the Homogeneity Test, ANOVA, Tukey and Pearson correlation and compared to the performance of sucupira wood (Bowdichia nitida) and cumaru wood (Dipteryx odorata), often used in the furniture industry. The following results were found: Eucalyptus grandis had a lower value of shrinkage, being more suitable for furniture components that require high dimensional stability, as well as parts of larger surface. The wood of this species showed a rate of dimensional variation compatible with the native species used in the furniture industry. The radial variation of the wood was also verified, and a high correlation between specific gravity and shrinkage was found. Longitudinally, the base of the trunk of the eucalyptus trees was shown to be the region of greatest dimensional stability.
                                
Resumo:
Matrix population models, elasticity analysis and loop analysis can potentially provide powerful techniques for the analysis of life histories. Data from a capture-recapture study on a population of southern highland water skinks (Eulamprus tympanum) were used to construct a matrix population model. Errors in elasticities were calculated by using the parametric bootstrap technique. Elasticity and loop analyses were then conducted to identify the life history stages most important to fitness. The same techniques were used to investigate the relative importance of fast versus slow growth, and rapid versus delayed reproduction. Mature water skinks were long-lived, but there was high immature mortality. The most sensitive life history stage was the subadult stage. It is suggested that life history evolution in E. tympanum may be strongly affected by predation, particularly by birds. Because our population declined over the study, slow growth and delayed reproduction were the optimal life history strategies over this period. Although the techniques of evolutionary demography provide a powerful approach for the analysis of life histories, there are formidable logistical obstacles in gathering enough high-quality data for robust estimates of the critical parameters.
                                
Resumo:
Purpose: The aim of this research was to assess the dimensional accuracy of orbital prostheses based on reversed images generated by computer-aided design/computer-assisted manufacturing (CAD/CAM) using computed tomography (CT) scans. Materials and Methods: CT scans of the faces of 15 adults, men and women older than 25 years of age not bearing any congenital or acquired craniofacial defects, were processed using CAD software to produce 30 reversed three-dimensional models of the orbital region. These models were then processed using the CAM system by means of selective laser sintering to generate surface prototypes of the volunteers` orbital regions. Two moulage impressions of the faces of each volunteer were taken to manufacture 15 pairs of casts. Orbital defects were created on the right or left side of each cast. The surface prototypes were adapted to the casts and then flasked to fabricate silicone prostheses. The establishment of anthropometric landmarks on the orbital region and facial midline allowed for the data collection of 31 linear measurements, used to assess the dimensional accuracy of the orbital prostheses and their location on the face. Results: The comparative analyses of the linear measurements taken from the orbital prostheses and the opposite sides that originated the surface prototypes demonstrated that the orbital prostheses presented similar vertical, transversal, and oblique dimensions, as well as similar depth. There was no transverse or oblique displacement of the prostheses. Conclusion: From a clinical perspective, the small differences observed after analyzing all 31 linear measurements did not indicate facial asymmetry. The dimensional accuracy of the orbital prostheses suggested that the CAD/CAM system assessed herein may be applicable for clinical purposes. Int J Prosthodont 2010;23:271-276.
                                
Resumo:
Phenylethanolamine N-methyltransferase, PNMT, utilizes the methylating cofactor S-adenosyl-L-methionine to catalyse the synthesis of adrenaline. Human PNMT has been crystallized in complex with an inhibitor and the cofactor product S-adenosyl-L-homocysteine using the hanging-drop technique with PEG 6000 and lithium chloride as precipitant. A critical requirement for crystallization was a high enzyme concentration (>90 mg ml(-1)) and cryocrystallography was used for high-quality data measurement. Diffraction data measured from a cryocooled crystal extend to a resolution of 2.3 Angstrom. Cryocooled crystals belong to space group P4(3)2(1)2 and have unit-cell parameters a = b = 94.3, c = 187.7 Angstrom.
                                
Resumo:
This paper addresses robust model-order reduction of a high dimensional nonlinear partial differential equation (PDE) model of a complex biological process. Based on a nonlinear, distributed parameter model of the same process which was validated against experimental data of an existing, pilot-scale BNR activated sludge plant, we developed a state-space model with 154 state variables in this work. A general algorithm for robustly reducing the nonlinear PDE model is presented and based on an investigation of five state-of-the-art model-order reduction techniques, we are able to reduce the original model to a model with only 30 states without incurring pronounced modelling errors. The Singular perturbation approximation balanced truncating technique is found to give the lowest modelling errors in low frequency ranges and hence is deemed most suitable for controller design and other real-time applications. (C) 2002 Elsevier Science Ltd. All rights reserved.
                                
Resumo:
Many learning problems require handling high dimensional datasets with a relatively small number of instances. Learning algorithms are thus confronted with the curse of dimensionality, and need to address it in order to be effective. Examples of these types of data include the bag-of-words representation in text classification problems and gene expression data for tumor detection/classification. Usually, among the high number of features characterizing the instances, many may be irrelevant (or even detrimental) for the learning tasks. It is thus clear that there is a need for adequate techniques for feature representation, reduction, and selection, to improve both the classification accuracy and the memory requirements. In this paper, we propose combined unsupervised feature discretization and feature selection techniques, suitable for medium and high-dimensional datasets. The experimental results on several standard datasets, with both sparse and dense features, show the efficiency of the proposed techniques as well as improvements over previous related techniques.
 
                    