948 resultados para Data structures (Computer science)
Resumo:
Large scale distributed data stores rely on optimistic replication to scale and remain highly available in the face of net work partitions. Managing data without coordination results in eventually consistent data stores that allow for concurrent data updates. These systems often use anti-entropy mechanisms (like Merkle Trees) to detect and repair divergent data versions across nodes. However, in practice hash-based data structures are too expensive for large amounts of data and create too many false conflicts. Another aspect of eventual consistency is detecting write conflicts. Logical clocks are often used to track data causality, necessary to detect causally concurrent writes on the same key. However, there is a nonnegligible metadata overhead per key, which also keeps growing with time, proportional with the node churn rate. Another challenge is deleting keys while respecting causality: while the values can be deleted, perkey metadata cannot be permanently removed without coordination. Weintroduceanewcausalitymanagementframeworkforeventuallyconsistentdatastores,thatleveragesnodelogicalclocks(BitmappedVersion Vectors) and a new key logical clock (Dotted Causal Container) to provides advantages on multiple fronts: 1) a new efficient and lightweight anti-entropy mechanism; 2) greatly reduced per-key causality metadata size; 3) accurate key deletes without permanent metadata.
Resumo:
Spatial data representation and compression has become a focus issue in computer graphics and image processing applications. Quadtrees, as one of hierarchical data structures, basing on the principle of recursive decomposition of space, always offer a compact and efficient representation of an image. For a given image, the choice of quadtree root node plays an important role in its quadtree representation and final data compression. The goal of this thesis is to present a heuristic algorithm for finding a root node of a region quadtree, which is able to reduce the number of leaf nodes when compared with the standard quadtree decomposition. The empirical results indicate that, this proposed algorithm has quadtree representation and data compression improvement when in comparison with the traditional method.
Resumo:
Abstract Big data nowadays is a fashionable topic, independently of what people mean when they use this term. But being big is just a matter of volume, although there is no clear agreement in the size threshold. On the other hand, it is easy to capture large amounts of data using a brute force approach. So the real goal should not be big data but to ask ourselves, for a given problem, what is the right data and how much of it is needed. For some problems this would imply big data, but for the majority of the problems much less data will and is needed. In this talk we explore the trade-offs involved and the main problems that come with big data using the Web as case study: scalability, redundancy, bias, noise, spam, and privacy. Speaker Biography Ricardo Baeza-Yates Ricardo Baeza-Yates is VP of Research for Yahoo Labs leading teams in United States, Europe and Latin America since 2006 and based in Sunnyvale, California, since August 2014. During this time he has lead the labs in Barcelona and Santiago de Chile. Between 2008 and 2012 he also oversaw the Haifa lab. He is also part time Professor at the Dept. of Information and Communication Technologies of the Universitat Pompeu Fabra, in Barcelona, Spain. During 2005 he was an ICREA research professor at the same university. Until 2004 he was Professor and before founder and Director of the Center for Web Research at the Dept. of Computing Science of the University of Chile (in leave of absence until today). He obtained a Ph.D. in CS from the University of Waterloo, Canada, in 1989. Before he obtained two masters (M.Sc. CS & M.Eng. EE) and the electronics engineer degree from the University of Chile in Santiago. He is co-author of the best-seller Modern Information Retrieval textbook, published in 1999 by Addison-Wesley with a second enlarged edition in 2011, that won the ASIST 2012 Book of the Year award. He is also co-author of the 2nd edition of the Handbook of Algorithms and Data Structures, Addison-Wesley, 1991; and co-editor of Information Retrieval: Algorithms and Data Structures, Prentice-Hall, 1992, among more than 500 other publications. From 2002 to 2004 he was elected to the board of governors of the IEEE Computer Society and in 2012 he was elected for the ACM Council. He has received the Organization of American States award for young researchers in exact sciences (1993), the Graham Medal for innovation in computing given by the University of Waterloo to distinguished ex-alumni (2007), the CLEI Latin American distinction for contributions to CS in the region (2009), and the National Award of the Chilean Association of Engineers (2010), among other distinctions. In 2003 he was the first computer scientist to be elected to the Chilean Academy of Sciences and since 2010 is a founding member of the Chilean Academy of Engineering. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow.
Resumo:
In this session we'll explore how Microsoft uses data science and machine learning across it's entire business, from Windows and Office, to Skype and XBox. We'll look at how companies across the world use Microsoft technology for empowering their businesses in many different industries. And we'll look at data science technologies you can use yourselves, such as Azure Machine Learning and Power BI. Finally we'll discuss job opportunities for data scientists and tips on how you can be successful!
Resumo:
Tests of the new Rossby wave theories that have been developed over the past decade to account for discrepancies between theoretical wave speeds and those observed by satellite altimeters have focused primarily on the surface signature of such waves. It appears, however, that the surface signature of the waves acts only as a rather weak constraint, and that information on the vertical structure of the waves is required to better discriminate between competing theories. Due to the lack of 3-D observations, this paper uses high-resolution model data to construct realistic vertical structures of Rossby waves and compares these to structures predicted by theory. The meridional velocity of a section at 24° S in the Atlantic Ocean is pre-processed using the Radon transform to select the dominant westward signal. Normalized profiles are then constructed using three complementary methods based respectively on: (1) averaging vertical profiles of velocity, (2) diagnosing the amplitude of the Radon transform of the westward propagating signal at different depths, and (3) EOF analysis. These profiles are compared to profiles calculated using four different Rossby wave theories: standard linear theory (SLT), SLT plus mean flow, SLT plus topographic effects, and theory including mean flow and topographic effects. Our results support the classical theoretical assumption that westward propagating signals have a well-defined vertical modal structure associated with a phase speed independent of depth, in contrast with the conclusions of a recent study using the same model but for different locations in the North Atlantic. The model structures are in general surface intensified, with a sign reversal at depth in some regions, notably occurring at shallower depths in the East Atlantic. SLT provides a good fit to the model structures in the top 300 m, but grossly overestimates the sign reversal at depth. The addition of mean flow slightly improves the latter issue, but is too surface intensified. SLT plus topography rectifies the overestimation of the sign reversal, but overestimates the amplitude of the structure for much of the layer above the sign reversal. Combining the effects of mean flow and topography provided the best fit for the mean model profiles, although small errors at the surface and mid-depths are carried over from the individual effects of mean flow and topography respectively. Across the section the best fitting theory varies between SLT plus topography and topography with mean flow, with, in general, SLT plus topography performing better in the east where the sign reversal is less pronounced. None of the theories could accurately reproduce the deeper sign reversals in the west. All theories performed badly at the boundaries. The generalization of this method to other latitudes, oceans, models and baroclinic modes would provide greater insight into the variability in the ocean, while better observational data would allow verification of the model findings.
Resumo:
The design of a network is a solution to several engineering and science problems. Several network design problems are known to be NP-hard, and population-based metaheuristics like evolutionary algorithms (EAs) have been largely investigated for such problems. Such optimization methods simultaneously generate a large number of potential solutions to investigate the search space in breadth and, consequently, to avoid local optima. Obtaining a potential solution usually involves the construction and maintenance of several spanning trees, or more generally, spanning forests. To efficiently explore the search space, special data structures have been developed to provide operations that manipulate a set of spanning trees (population). For a tree with n nodes, the most efficient data structures available in the literature require time O(n) to generate a new spanning tree that modifies an existing one and to store the new solution. We propose a new data structure, called node-depth-degree representation (NDDR), and we demonstrate that using this encoding, generating a new spanning forest requires average time O(root n). Experiments with an EA based on NDDR applied to large-scale instances of the degree-constrained minimum spanning tree problem have shown that the implementation adds small constants and lower order terms to the theoretical bound.
Resumo:
Thesis (M. S.)--University of Illinois at Urbana-Champaign.
Resumo:
An implementation of Sem-ODB—a database management system based on the Semantic Binary Model is presented. A metaschema of Sem-ODB database as well as the top-level architecture of the database engine is defined. A new benchmarking technique is proposed which allows databases built on different database models to compete fairly. This technique is applied to show that Sem-ODB has excellent efficiency comparing to a relational database on a certain class of database applications. A new semantic benchmark is designed which allows evaluation of the performance of the features characteristic of semantic database applications. An application used in the benchmark represents a class of problems requiring databases with sparse data, complex inheritances and many-to-many relations. Such databases can be naturally accommodated by semantic model. A fixed predefined implementation is not enforced allowing the database designer to choose the most efficient structures available in the DBMS tested. The results of the benchmark are analyzed. ^ A new high-level querying model for semantic databases is defined. It is proven adequate to serve as an efficient native semantic database interface, and has several advantages over the existing interfaces. It is optimizable and parallelizable, supports the definition of semantic userviews and the interoperability of semantic databases with other data sources such as World Wide Web, relational, and object-oriented databases. The query is structured as a semantic database schema graph with interlinking conditionals. The query result is a mini-database, accessible in the same way as the original database. The paradigm supports and utilizes the rich semantics and inherent ergonomics of semantic databases. ^ The analysis and high-level design of a system that exploits the superiority of the Semantic Database Model to other data models in expressive power and ease of use to allow uniform access to heterogeneous data sources such as semantic databases, relational databases, web sites, ASCII files, and others via a common query interface is presented. The Sem-ODB engine is used to control all the data sources combined under a unified semantic schema. A particular application of the system to provide an ODBC interface to the WWW as a data source is discussed. ^
Resumo:
Since multimedia data, such as images and videos, are way more expressive and informative than ordinary text-based data, people find it more attractive to communicate and express with them. Additionally, with the rising popularity of social networking tools such as Facebook and Twitter, multimedia information retrieval can no longer be considered a solitary task. Rather, people constantly collaborate with one another while searching and retrieving information. But the very cause of the popularity of multimedia data, the huge and different types of information a single data object can carry, makes their management a challenging task. Multimedia data is commonly represented as multidimensional feature vectors and carry high-level semantic information. These two characteristics make them very different from traditional alpha-numeric data. Thus, to try to manage them with frameworks and rationales designed for primitive alpha-numeric data, will be inefficient. An index structure is the backbone of any database management system. It has been seen that index structures present in existing relational database management frameworks cannot handle multimedia data effectively. Thus, in this dissertation, a generalized multidimensional index structure is proposed which accommodates the atypical multidimensional representation and the semantic information carried by different multimedia data seamlessly from within one single framework. Additionally, the dissertation investigates the evolving relationships among multimedia data in a collaborative environment and how such information can help to customize the design of the proposed index structure, when it is used to manage multimedia data in a shared environment. Extensive experiments were conducted to present the usability and better performance of the proposed framework over current state-of-art approaches.
Resumo:
Bayesian methods offer a flexible and convenient probabilistic learning framework to extract interpretable knowledge from complex and structured data. Such methods can characterize dependencies among multiple levels of hidden variables and share statistical strength across heterogeneous sources. In the first part of this dissertation, we develop two dependent variational inference methods for full posterior approximation in non-conjugate Bayesian models through hierarchical mixture- and copula-based variational proposals, respectively. The proposed methods move beyond the widely used factorized approximation to the posterior and provide generic applicability to a broad class of probabilistic models with minimal model-specific derivations. In the second part of this dissertation, we design probabilistic graphical models to accommodate multimodal data, describe dynamical behaviors and account for task heterogeneity. In particular, the sparse latent factor model is able to reveal common low-dimensional structures from high-dimensional data. We demonstrate the effectiveness of the proposed statistical learning methods on both synthetic and real-world data.
Resumo:
Ecosystem engineers that increase habitat complexity are keystone species in marine systems, increasing shelter and niche availability, and therefore biodiversity. For example, kelp holdfasts form intricate structures and host the largest number of organisms in kelp ecosystems. However, methods that quantify 3D habitat complexity have only seldom been used in marine habitats, and never in kelp holdfast communities. This study investigated the role of kelp holdfasts (Laminaria hyperborea) in supporting benthic faunal biodiversity. Computer-aided tomography (CT-) scanning was used to quantify the three-dimensional geometrical complexity of holdfasts, including volume, surface area and surface fractal dimension (FD). Additionally, the number of haptera, number of haptera per unit of volume, and age of kelps were estimated. These measurements were compared to faunal biodiversity and community structure, using partial least-squares regression and multivariate ordination. Holdfast volume explained most of the variance observed in biodiversity indices, however all other complexity measures also strongly contributed to the variance observed. Multivariate ordinations further revealed that surface area and haptera per unit of volume accounted for the patterns observed in faunal community structure. Using 3D image analysis, this study makes a strong contribution to elucidate quantitative mechanisms underlying the observed relationship between biodiversity and habitat complexity. Furthermore, the potential of CT-scanning as an ecological tool is demonstrated, and a methodology for its use in future similar studies is established. Such spatially resolved imager analysis could help identify structurally complex areas as biodiversity hotspots, and may support the prioritization of areas for conservation.
Resumo:
Ecosystem engineers that increase habitat complexity are keystone species in marine systems, increasing shelter and niche availability, and therefore biodiversity. For example, kelp holdfasts form intricate structures and host the largest number of organisms in kelp ecosystems. However, methods that quantify 3D habitat complexity have only seldom been used in marine habitats, and never in kelp holdfast communities. This study investigated the role of kelp holdfasts (Laminaria hyperborea) in supporting benthic faunal biodiversity. Computer-aided tomography (CT-) scanning was used to quantify the three-dimensional geometrical complexity of holdfasts, including volume, surface area and surface fractal dimension (FD). Additionally, the number of haptera, number of haptera per unit of volume, and age of kelps were estimated. These measurements were compared to faunal biodiversity and community structure, using partial least-squares regression and multivariate ordination. Holdfast volume explained most of the variance observed in biodiversity indices, however all other complexity measures also strongly contributed to the variance observed. Multivariate ordinations further revealed that surface area and haptera per unit of volume accounted for the patterns observed in faunal community structure. Using 3D image analysis, this study makes a strong contribution to elucidate quantitative mechanisms underlying the observed relationship between biodiversity and habitat complexity. Furthermore, the potential of CT-scanning as an ecological tool is demonstrated, and a methodology for its use in future similar studies is established. Such spatially resolved imager analysis could help identify structurally complex areas as biodiversity hotspots, and may support the prioritization of areas for conservation.
Resumo:
The generation of heterogeneous big data sources with ever increasing volumes, velocities and veracities over the he last few years has inspired the data science and research community to address the challenge of extracting knowledge form big data. Such a wealth of generated data across the board can be intelligently exploited to advance our knowledge about our environment, public health, critical infrastructure and security. In recent years we have developed generic approaches to process such big data at multiple levels for advancing decision-support. It specifically concerns data processing with semantic harmonisation, low level fusion, analytics, knowledge modelling with high level fusion and reasoning. Such approaches will be introduced and presented in context of the TRIDEC project results on critical oil and gas industry drilling operations and also the ongoing large eVacuate project on critical crowd behaviour detection in confined spaces.
Resumo:
Abstract Heading into the 2020s, Physics and Astronomy are undergoing experimental revolutions that will reshape our picture of the fabric of the Universe. The Large Hadron Collider (LHC), the largest particle physics project in the world, produces 30 petabytes of data annually that need to be sifted through, analysed, and modelled. In astrophysics, the Large Synoptic Survey Telescope (LSST) will be taking a high-resolution image of the full sky every 3 days, leading to data rates of 30 terabytes per night over ten years. These experiments endeavour to answer the question why 96% of the content of the universe currently elude our physical understanding. Both the LHC and LSST share the 5-dimensional nature of their data, with position, energy and time being the fundamental axes. This talk will present an overview of the experiments and data that is gathered, and outlines the challenges in extracting information. Common strategies employed are very similar to industrial data! Science problems (e.g., data filtering, machine learning, statistical interpretation) and provide a seed for exchange of knowledge between academia and industry. Speaker Biography Professor Mark Sullivan Mark Sullivan is a Professor of Astrophysics in the Department of Physics and Astronomy. Mark completed his PhD at Cambridge, and following postdoctoral study in Durham, Toronto and Oxford, now leads a research group at Southampton studying dark energy using exploding stars called "type Ia supernovae". Mark has many years' experience of research that involves repeatedly imaging the night sky to track the arrival of transient objects, involving significant challenges in data handling, processing, classification and analysis.
Resumo:
Thesis (Ph.D.)--University of Washington, 2016-08