64 resultados para mining data streams
Resumo:
Sea surface temperature (SST) measurements are required by operational ocean and atmospheric forecasting systems to constrain modeled upper ocean circulation and thermal structure. The Global Ocean Data Assimilation Experiment (GODAE) High Resolution SST Pilot Project (GHRSST-PP) was initiated to address these needs by coordinating the provision of accurate, high-resolution, SST products for the global domain. The pilot project is now complete, but activities continue within the Group for High Resolution SST (GHRSST). The pilot project focused on harmonizing diverse satellite and in situ data streams that were indexed, processed, quality controlled, analyzed, and documented within a Regional/Global Task Sharing (R/GTS) framework implemented in an internationally distributed manner. Data with meaningful error estimates developed within GHRSST are provided by services within R/GTS. Currently, several terabytes of data are processed at international centers daily, creating more than 25 gigabytes of product. Ensemble SST analyses together with anomaly SST outputs are generated each day, providing confidence in SST analyses via diagnostic outputs. Diagnostic data sets are generated and Web interfaces are provided to monitor the quality of observation and analysis products. GHRSST research and development projects continue to tackle problems of instrument calibration, algorithm development, diurnal variability, skin temperature deviation, and validation/verification of GHRSST products. GHRSST also works closely with applications and users, providing a forum for discussion and feedback between SST users and producers on a regular basis. All data within the GHRSST R/GTS framework are freely available. This paper reviews the progress of GHRSST-PP, highlighting achievements that have been fundamental to the success of the pilot project.
Resumo:
In this article, we investigate how the choice of the attenuation factor in an extended version of Katz centrality influences the centrality of the nodes in evolving communication networks. For given snapshots of a network, observed over a period of time, recently developed communicability indices aim to identify the best broadcasters and listeners (receivers) in the network. Here we explore the attenuation factor constraint, in relation to the spectral radius (the largest eigenvalue) of the network at any point in time and its computation in the case of large networks. We compare three different communicability measures: standard, exponential, and relaxed (where the spectral radius bound on the attenuation factor is relaxed and the adjacency matrix is normalised, in order to maintain the convergence of the measure). Furthermore, using a vitality-based measure of both standard and relaxed communicability indices, we look at the ways of establishing the most important individuals for broadcasting and receiving of messages related to community bridging roles. We compare those measures with the scores produced by an iterative version of the PageRank algorithm and illustrate our findings with two examples of real-life evolving networks: the MIT reality mining data set, consisting of daily communications between 106 individuals over the period of one year, a UK Twitter mentions network, constructed from the direct \emph{tweets} between 12.4k individuals during one week, and a subset the Enron email data set.
Resumo:
Body Sensor Networks (BSNs) have been recently introduced for the remote monitoring of human activities in a broad range of application domains, such as health care, emergency management, fitness and behaviour surveillance. BSNs can be deployed in a community of people and can generate large amounts of contextual data that require a scalable approach for storage, processing and analysis. Cloud computing can provide a flexible storage and processing infrastructure to perform both online and offline analysis of data streams generated in BSNs. This paper proposes BodyCloud, a SaaS approach for community BSNs that supports the development and deployment of Cloud-assisted BSN applications. BodyCloud is a multi-tier application-level architecture that integrates a Cloud computing platform and BSN data streams middleware. BodyCloud provides programming abstractions that allow the rapid development of community BSN applications. This work describes the general architecture of the proposed approach and presents a case study for the real-time monitoring and analysis of cardiac data streams of many individuals.
Resumo:
Body area networks (BANs) are emerging as enabling technology for many human-centered application domains such as health-care, sport, fitness, wellness, ergonomics, emergency, safety, security, and sociality. A BAN, which basically consists of wireless wearable sensor nodes usually coordinated by a static or mobile device, is mainly exploited to monitor single assisted livings. Data generated by a BAN can be processed in real-time by the BAN coordinator and/or transmitted to a server-side for online/offline processing and long-term storing. A network of BANs worn by a community of people produces large amount of contextual data that require a scalable and efficient approach for elaboration and storage. Cloud computing can provide a flexible storage and processing infrastructure to perform both online and offline analysis of body sensor data streams. In this paper, we motivate the introduction of Cloud-assisted BANs along with the main challenges that need to be addressed for their development and management. The current state-of-the-art is overviewed and framed according to the main requirements for effective Cloud-assisted BAN architectures. Finally, relevant open research issues in terms of efficiency, scalability, security, interoperability, prototyping, dynamic deployment and management, are discussed.
Resumo:
Our digital universe is rapidly expanding,more and more daily activities are digitally recorded, data arrives in streams, it needs to be analyzed in real time and may evolve over time. In the last decade many adaptive learning algorithms and prediction systems, which can automatically update themselves with the new incoming data, have been developed. The majority of those algorithms focus on improving the predictive performance and assume that model update is always desired as soon as possible and as frequently as possible. In this study we consider potential model update as an investment decision, which, as in the financial markets, should be taken only if a certain return on investment is expected. We introduce and motivate a new research problem for data streams ? cost-sensitive adaptation. We propose a reference framework for analyzing adaptation strategies in terms of costs and benefits. Our framework allows to characterize and decompose the costs of model updates, and to asses and interpret the gains in performance due to model adaptation for a given learning algorithm on a given prediction task. Our proof-of-concept experiment demonstrates how the framework can aid in analyzing and managing adaptation decisions in the chemical industry.
Resumo:
We present a general Multi-Agent System framework for distributed data mining based on a Peer-to-Peer model. Agent protocols are implemented through message-based asynchronous communication. The framework adopts a dynamic load balancing policy that is particularly suitable for irregular search algorithms. A modular design allows a separation of the general-purpose system protocols and software components from the specific data mining algorithm. The experimental evaluation has been carried out on a parallel frequent subgraph mining algorithm, which has shown good scalability performances.
Resumo:
A wireless sensor network (WSN) is a group of sensors linked by wireless medium to perform distributed sensing tasks. WSNs have attracted a wide interest from academia and industry alike due to their diversity of applications, including home automation, smart environment, and emergency services, in various buildings. The primary goal of a WSN is to collect data sensed by sensors. These data are characteristic of being heavily noisy, exhibiting temporal and spatial correlation. In order to extract useful information from such data, as this paper will demonstrate, people need to utilise various techniques to analyse the data. Data mining is a process in which a wide spectrum of data analysis methods is used. It is applied in the paper to analyse data collected from WSNs monitoring an indoor environment in a building. A case study is given to demonstrate how data mining can be used to optimise the use of the office space in a building.
Resumo:
Knowledge-elicitation is a common technique used to produce rules about the operation of a plant from the knowledge that is available from human expertise. Similarly, data-mining is becoming a popular technique to extract rules from the data available from the operation of a plant. In the work reported here knowledge was required to enable the supervisory control of an aluminium hot strip mill by the determination of mill set-points. A method was developed to fuse knowledge-elicitation and data-mining to incorporate the best aspects of each technique, whilst avoiding known problems. Utilisation of the knowledge was through an expert system, which determined schedules of set-points and provided information to human operators. The results show that the method proposed in this paper was effective in producing rules for the on-line control of a complex industrial process. (C) 2005 Elsevier Ltd. All rights reserved.
Resumo:
Knowledge-elicitation is a common technique used to produce rules about the operation of a plant from the knowledge that is available from human expertise. Similarly, data-mining is becoming a popular technique to extract rules from the data available from the operation of a plant. In the work reported here knowledge was required to enable the supervisory control of an aluminium hot strip mill by the determination of mill set-points. A method was developed to fuse knowledge-elicitation and data-mining to incorporate the best aspects of each technique, whilst avoiding known problems. Utilisation of the knowledge was through an expert system, which determined schedules of set-points and provided information to human operators. The results show that the method proposed in this paper was effective in producing rules for the on-line control of a complex industrial process.
Resumo:
This is a report on the data-mining of two chess databases, the objective being to compare their sub-7-man content with perfect play as documented in Nalimov endgame tables. Van der Heijden’s ENDGAME STUDY DATABASE IV is a definitive collection of 76,132 studies in which White should have an essentially unique route to the stipulated goal. Chessbase’s BIG DATABASE 2010 holds some 4.5 million games. Insight gained into both database content and data-mining has led to some delightful surprises and created a further agenda.
Resumo:
Aircraft Maintenance, Repair and Overhaul (MRO) agencies rely largely on row-data based quotation systems to select the best suppliers for the customers (airlines). The data quantity and quality becomes a key issue to determining the success of an MRO job, since we need to ensure we achieve cost and quality benchmarks. This paper introduces a data mining approach to create an MRO quotation system that enhances the data quantity and data quality, and enables significantly more precise MRO job quotations. Regular Expression was utilized to analyse descriptive textual feedback (i.e. engineer’s reports) in order to extract more referable highly normalised data for job quotation. A text mining based key influencer analysis function enables the user to proactively select sub-parts, defects and possible solutions to make queries more accurate. Implementation results show that system data would improve cost quotation in 40% of MRO jobs, would reduce service cost without causing a drop in service quality.
Resumo:
Recently major processor manufacturers have announced a dramatic shift in their paradigm to increase computing power over the coming years. Instead of focusing on faster clock speeds and more powerful single core CPUs, the trend clearly goes towards multi core systems. This will also result in a paradigm shift for the development of algorithms for computationally expensive tasks, such as data mining applications. Obviously, work on parallel algorithms is not new per se but concentrated efforts in the many application domains are still missing. Multi-core systems, but also clusters of workstations and even large-scale distributed computing infrastructures provide new opportunities and pose new challenges for the design of parallel and distributed algorithms. Since data mining and machine learning systems rely on high performance computing systems, research on the corresponding algorithms must be on the forefront of parallel algorithm research in order to keep pushing data mining and machine learning applications to be more powerful and, especially for the former, interactive. To bring together researchers and practitioners working in this exciting field, a workshop on parallel data mining was organized as part of PKDD/ECML 2006 (Berlin, Germany). The six contributions selected for the program describe various aspects of data mining and machine learning approaches featuring low to high degrees of parallelism: The first contribution focuses the classic problem of distributed association rule mining and focuses on communication efficiency to improve the state of the art. After this a parallelization technique for speeding up decision tree construction by means of thread-level parallelism for shared memory systems is presented. The next paper discusses the design of a parallel approach for dis- tributed memory systems of the frequent subgraphs mining problem. This approach is based on a hierarchical communication topology to solve issues related to multi-domain computational envi- ronments. The forth paper describes the combined use and the customization of software packages to facilitate a top down parallelism in the tuning of Support Vector Machines (SVM) and the next contribution presents an interesting idea concerning parallel training of Conditional Random Fields (CRFs) and motivates their use in labeling sequential data. The last contribution finally focuses on very efficient feature selection. It describes a parallel algorithm for feature selection from random subsets. Selecting the papers included in this volume would not have been possible without the help of an international Program Committee that has provided detailed reviews for each paper. We would like to also thank Matthew Otey who helped with publicity for the workshop.