775 resultados para mining data streams
Resumo:
With the overwhelming increase in the amount of texts on the web, it is almost impossible for people to keep abreast of up-to-date information. Text mining is a process by which interesting information is derived from text through the discovery of patterns and trends. Text mining algorithms are used to guarantee the quality of extracted knowledge. However, the extracted patterns using text or data mining algorithms or methods leads to noisy patterns and inconsistency. Thus, different challenges arise, such as the question of how to understand these patterns, whether the model that has been used is suitable, and if all the patterns that have been extracted are relevant. Furthermore, the research raises the question of how to give a correct weight to the extracted knowledge. To address these issues, this paper presents a text post-processing method, which uses a pattern co-occurrence matrix to find the relation between extracted patterns in order to reduce noisy patterns. The main objective of this paper is not only reducing the number of closed sequential patterns, but also improving the performance of pattern mining as well. The experimental results on Reuters Corpus Volume 1 data collection and TREC filtering topics show that the proposed method is promising.
Resumo:
Big data is big news in almost every sector including crisis communication. However, not everyone has access to big data and even if we have access to big data, we often do not have necessary tools to analyze and cross reference such a large data set. Therefore this paper looks at patterns in small data sets that we have ability to collect with our current tools to understand if we can find actionable information from what we already have. We have analyzed 164390 tweets collected during 2011 earthquake to find out what type of location specific information people mention in their tweet and when do they talk about that. Based on our analysis we find that even a small data set that has far less data than a big data set can be useful to find priority disaster specific areas quickly.
Resumo:
Currently, recommender systems (RS) have been widely applied in many commercial e-commerce sites to help users deal with the information overload problem. Recommender systems provide personalized recommendations to users and thus help them in making good decisions about which product to buy from the vast number of product choices available to them. Many of the current recommender systems are developed for simple and frequently purchased products like books and videos, by using collaborative-filtering and content-based recommender system approaches. These approaches are not suitable for recommending luxurious and infrequently purchased products as they rely on a large amount of ratings data that is not usually available for such products. This research aims to explore novel approaches for recommending infrequently purchased products by exploiting user generated content such as user reviews and product click streams data. From reviews on products given by the previous users, association rules between product attributes are extracted using an association rule mining technique. Furthermore, from product click streams data, user profiles are generated using the proposed user profiling approach. Two recommendation approaches are proposed based on the knowledge extracted from these resources. The first approach is developed by formulating a new query from the initial query given by the target user, by expanding the query with the suitable association rules. In the second approach, a collaborative-filtering recommender system and search-based approaches are integrated within a hybrid system. In this hybrid system, user profiles are used to find the target user’s neighbour and the subsequent products viewed by them are then used to search for other relevant products. Experiments have been conducted on a real world dataset collected from one of the online car sale companies in Australia to evaluate the effectiveness of the proposed recommendation approaches. The experiment results show that user profiles generated from user click stream data and association rules generated from user reviews can improve recommendation accuracy. In addition, the experiment results also prove that the proposed query expansion and the hybrid collaborative filtering and search-based approaches perform better than the baseline approaches. Integrating the collaborative-filtering and search-based approaches has been challenging as this strategy has not been widely explored so far especially for recommending infrequently purchased products. Therefore, this research will provide a theoretical contribution to the recommender system field as a new technique of combining collaborative-filtering and search-based approaches will be developed. This research also contributes to a development of a new query expansion technique for infrequently purchased products recommendation. This research will also provide a practical contribution to the development of a prototype system for recommending cars.
Resumo:
Understanding network traffic behaviour is crucial for managing and securing computer networks. One important technique is to mine frequent patterns or association rules from analysed traffic data. On the one hand, association rule mining usually generates a huge number of patterns and rules, many of them meaningless or user-unwanted; on the other hand, association rule mining can miss some necessary knowledge if it does not consider the hierarchy relationships in the network traffic data. Aiming to address such issues, this paper proposes a hybrid association rule mining method for characterizing network traffic behaviour. Rather than frequent patterns, the proposed method generates non-similar closed frequent patterns from network traffic data, which can significantly reduce the number of patterns. This method also proposes to derive new attributes from the original data to discover novel knowledge according to hierarchy relationships in network traffic data and user interests. Experiments performed on real network traffic data show that the proposed method is promising and can be used in real applications. Copyright2013 John Wiley & Sons, Ltd.
Resumo:
The Council of Australian Governments (COAG) in 2003 gave in-principle approval to a best-practice report recommending a holistic approach to managing natural disasters in Australia incorporating a move from a traditional response-centric approach to a greater focus on mitigation, recovery and resilience with community well-being at the core. Since that time, there have been a range of complementary developments that have supported the COAG recommended approach. Developments have been administrative, legislative and technological, both, in reaction to the COAG initiative and resulting from regular natural disasters. This paper reviews the characteristics of the spatial data that is becoming increasingly available at Federal, state and regional jurisdictions with respect to their being fit for the purpose for disaster planning and mitigation and strengthening community resilience. In particular, Queensland foundation spatial data, which is increasingly accessible by the public under the provisions of the Right to Information Act 2009, Information Privacy Act 2009, and recent open data reform initiatives are evaluated. The Fitzroy River catchment and floodplain is used as a case study for the review undertaken. The catchment covers an area of 142,545 km2, the largest river catchment flowing to the eastern coast of Australia. The Fitzroy River basin experienced extensive flooding during the 2010–2011 Queensland floods. The basin is an area of important economic, environmental and heritage values and contains significant infrastructure critical for the mining and agricultural sectors, the two most important economic sectors for Queensland State. Consequently, the spatial datasets for this area play a critical role in disaster management and for protecting critical infrastructure essential for economic and community well-being. The foundation spatial datasets are assessed for disaster planning and mitigation purposes using data quality indicators such as resolution, accuracy, integrity, validity and audit trail.
Resumo:
Smart Card data from Automated Fare Collection system has been considered as a promising source of information for transit planning. However, literature has been limited to mining travel patterns from transit users and suggesting the potential of using this information. This paper proposes a method for mining spatial regular origins-destinations and temporal habitual travelling time from transit users. These travel regularity are discussed as being useful for transit planning. After reconstructing the travel itineraries, three levels of Density-Based Spatial Clustering of Application with Noise (DBSCAN) have been utilised to retrieve travel regularity of each of each frequent transit users. Analyses of passenger classifications and personal travel time variability estimation are performed as the examples of using travel regularity in transit planning. The methodology introduced in this paper is of interest for transit authorities in planning and managements
Resumo:
Bioacoustic data can provide an important base for environmental monitoring. To explore a large amount of field recordings collected, an automated similarity search algorithm is presented in this paper. A region of an audio defined by frequency and time bounds is provided by a user; the content of the region is used to construct a query. In the retrieving process, our algorithm will automatically scan through recordings to search for similar regions. In detail, we present a feature extraction approach based on the visual content of vocalisations – in this case ridges, and develop a generic regional representation of vocalisations for indexing. Our feature extraction method works best for bird vocalisations showing ridge characteristics. The regional representation method allows the content of an arbitrary region of a continuous recording to be described in a compressed format.
Resumo:
The mining equipment technology services sector is driven by a reactive and user-centered design approach, with a technological focus on incremental new product development. As Australia moves out of its sustained mining boom, companies need to rethink their strategic position, to become agile to stay relevant in an enigmatic market. This paper reports on the first five months on an embedded case study within an Australian, family-owned mining manufacturer. The first author is currently engaged in a longitudinal design led innovation project, as a catalyst to guide the company’s journey to design integration. The results find that design led innovation could act as a channel for highlighting and exploring company disconnections with the marketplace and offer a customer-centric catalyst for internal change. Data collected for this study is from 12 analysed semistructured interviews, a focus group and a reflective journal, over a five-month period. This paper explores limitations to design integration, and highlights opportunities to explore and leverage entrepreneurial characteristics to stay agile, broaden innovation and future-proof through the next commodity cycle in the mining industry.
Resumo:
A people-to-people matching system (or a match-making system) refers to a system in which users join with the objective of meeting other users with the common need. Some real-world examples of these systems are employer-employee (in job search networks), mentor-student (in university social networks), consume-to-consumer (in marketplaces) and male-female (in an online dating network). The network underlying in these systems consists of two groups of users, and the relationships between users need to be captured for developing an efficient match-making system. Most of the existing studies utilize information either about each of the users in isolation or their interaction separately, and develop recommender systems using the one form of information only. It is imperative to understand the linkages among the users in the network and use them in developing a match-making system. This study utilizes several social network analysis methods such as graph theory, small world phenomenon, centrality analysis, density analysis to gain insight into the entities and their relationships present in this network. This paper also proposes a new type of graph called “attributed bipartite graph”. By using these analyses and the proposed type of graph, an efficient hybrid recommender system is developed which generates recommendation for new users as well as shows improvement in accuracy over the baseline methods.
Resumo:
Trees are capable of portraying the semi-structured data which is common in web domain. Finding similarities between trees is mandatory for several applications that deal with semi-structured data. Existing similarity methods examine a pair of trees by comparing through nodes and paths of two trees, and find the similarity between them. However, these methods provide unfavorable results for unordered tree data and result in yielding NP-hard or MAX-SNP hard complexity. In this paper, we present a novel method that encodes a tree with an optimal traversing approach first, and then, utilizes it to model the tree with its equivalent matrix representation for finding similarity between unordered trees efficiently. Empirical analysis shows that the proposed method is able to achieve high accuracy even on the large data sets.
Resumo:
Transit passenger market segmentation enables transit operators to target different classes of transit users to provide customized information and services. The Smart Card (SC) data, from Automated Fare Collection system, facilitates the understanding of multiday travel regularity of transit passengers, and can be used to segment them into identifiable classes of similar behaviors and needs. However, the use of SC data for market segmentation has attracted very limited attention in the literature. This paper proposes a novel methodology for mining spatial and temporal travel regularity from each individual passenger’s historical SC transactions and segments them into four segments of transit users. After reconstructing the travel itineraries from historical SC transactions, the paper adopts the Density-Based Spatial Clustering of Application with Noise (DBSCAN) algorithm to mine travel regularity of each SC user. The travel regularity is then used to segment SC users by an a priori market segmentation approach. The methodology proposed in this paper assists transit operators to understand their passengers and provide them oriented information and services.
Resumo:
Business process analysis and process mining, particularly within the health care domain, remain under-utilised. Applied research that employs such techniques to routinely collected, health care data enables stakeholders to empirically investigate care as it is delivered by different health providers. However, cross-organisational mining and the comparative analysis of processes present a set of unique challenges in terms of ensuring population and activity comparability, visualising the mined models and interpreting the results. Without addressing these issues, health providers will find it difficult to use process mining insights, and the potential benefits of evidence-based process improvement within health will remain unrealised. In this paper, we present a brief introduction on the nature of health care processes; a review of the process mining in health literature; and a case study conducted to explore and learn how health care data, and cross-organisational comparisons with process mining techniques may be approached. The case study applies process mining techniques to administrative and clinical data for patients who present with chest pain symptoms at one of four public hospitals in South Australia. We demonstrate an approach that provides detailed insights into clinical (quality of patient health) and fiscal (hospital budget) pressures in health care practice. We conclude by discussing the key lessons learned from our experience in conducting business process analysis and process mining based on the data from four different hospitals.
Resumo:
Big Data presents many challenges related to volume, whether one is interested in studying past datasets or, even more problematically, attempting to work with live streams of data. The most obvious challenge, in a ‘noisy’ environment such as contemporary social media, is to collect the pertinent information; be that information for a specific study, tweets which can inform emergency services or other responders to an ongoing crisis, or give an advantage to those involved in prediction markets. Often, such a process is iterative, with keywords and hashtags changing with the passage of time, and both collection and analytic methodologies need to be continually adapted to respond to this changing information. While many of the data sets collected and analyzed are preformed, that is they are built around a particular keyword, hashtag, or set of authors, they still contain a large volume of information, much of which is unnecessary for the current purpose and/or potentially useful for future projects. Accordingly, this panel considers methods for separating and combining data to optimize big data research and report findings to stakeholders. The first paper considers possible coding mechanisms for incoming tweets during a crisis, taking a large stream of incoming tweets and selecting which of those need to be immediately placed in front of responders, for manual filtering and possible action. The paper suggests two solutions for this, content analysis and user profiling. In the former case, aspects of the tweet are assigned a score to assess its likely relationship to the topic at hand, and the urgency of the information, whilst the latter attempts to identify those users who are either serving as amplifiers of information or are known as an authoritative source. Through these techniques, the information contained in a large dataset could be filtered down to match the expected capacity of emergency responders, and knowledge as to the core keywords or hashtags relating to the current event is constantly refined for future data collection. The second paper is also concerned with identifying significant tweets, but in this case tweets relevant to particular prediction market; tennis betting. As increasing numbers of professional sports men and women create Twitter accounts to communicate with their fans, information is being shared regarding injuries, form and emotions which have the potential to impact on future results. As has already been demonstrated with leading US sports, such information is extremely valuable. Tennis, as with American Football (NFL) and Baseball (MLB) has paid subscription services which manually filter incoming news sources, including tweets, for information valuable to gamblers, gambling operators, and fantasy sports players. However, whilst such services are still niche operations, much of the value of information is lost by the time it reaches one of these services. The paper thus considers how information could be filtered from twitter user lists and hash tag or keyword monitoring, assessing the value of the source, information, and the prediction markets to which it may relate. The third paper examines methods for collecting Twitter data and following changes in an ongoing, dynamic social movement, such as the Occupy Wall Street movement. It involves the development of technical infrastructure to collect and make the tweets available for exploration and analysis. A strategy to respond to changes in the social movement is also required or the resulting tweets will only reflect the discussions and strategies the movement used at the time the keyword list is created — in a way, keyword creation is part strategy and part art. In this paper we describe strategies for the creation of a social media archive, specifically tweets related to the Occupy Wall Street movement, and methods for continuing to adapt data collection strategies as the movement’s presence in Twitter changes over time. We also discuss the opportunities and methods to extract data smaller slices of data from an archive of social media data to support a multitude of research projects in multiple fields of study. The common theme amongst these papers is that of constructing a data set, filtering it for a specific purpose, and then using the resulting information to aid in future data collection. The intention is that through the papers presented, and subsequent discussion, the panel will inform the wider research community not only on the objectives and limitations of data collection, live analytics, and filtering, but also on current and in-development methodologies that could be adopted by those working with such datasets, and how such approaches could be customized depending on the project stakeholders.
Resumo:
Aims: To compare different methods for identifying alcohol involvement in injury-related emergency department presentation in Queensland youth, and to explore the alcohol terminology used in triage text. Methods: Emergency Department Information System data were provided for patients aged 12-24 years with an injury-related diagnosis code for a 5 year period 2006-2010 presenting to a Queensland emergency department (N=348895). Three approaches were used to estimate alcohol involvement: 1) analysis of coded data, 2) mining of triage text, and 3) estimation using an adaptation of alcohol attributable fractions (AAF). Cases were identified as ‘alcohol-involved’ by code and text, as well as AAF weighted. Results: Around 6.4% of these injury presentations overall had some documentation of alcohol involvement, with higher proportions of alcohol involvement documented for 18-24 year olds, females, indigenous youth, where presentations occurred on a Saturday or Sunday, and where presentations occurred between midnight and 5am. The most common alcohol terms identified for all subgroups were generic alcohol terms (eg. ETOH or alcohol) with almost half of the cases where alcohol involvement was documented having a generic alcohol term recorded in the triage text. Conclusions: Emergency department data is a useful source of information for identification of high risk sub-groups to target intervention opportunities, though it is not a reliable source of data for incidence or trend estimation in its current unstandardised form. Improving the accuracy and consistency of identification, documenting and coding of alcohol-involvement at the point of data capture in the emergency department is the most desirable long term approach to produce a more solid evidence base to support policy and practice in this field.
Resumo:
Modern health information systems can generate several exabytes of patient data, the so called "Health Big Data", per year. Many health managers and experts believe that with the data, it is possible to easily discover useful knowledge to improve health policies, increase patient safety and eliminate redundancies and unnecessary costs. The objective of this paper is to discuss the characteristics of Health Big Data as well as the challenges and solutions for health Big Data Analytics (BDA) – the process of extracting knowledge from sets of Health Big Data – and to design and evaluate a pipelined framework for use as a guideline/reference in health BDA.