1000 resultados para Incremental mining


Relevância:

20.00% 20.00%

Publicador:

Resumo:

Streams of short text, such as news titles, enable us to effectively and efficiently learn the real world events that occur anywhere and anytime. Short text messages that are companied by timestamps and generally brief events using only a few words differ from other longer text documents, such as web pages, news stories, blogs, technical papers and books. For example, few words repeat in the same news titles, thus frequency of the term (i.e., TF) is not as important in short text corpus as in longer text corpus. Therefore, analysis of short text faces new challenges. Also, detecting and tracking events through short text analysis need to reliably identify events from constant topic clusters; however, existing methods, such as Latent Dirichlet Allocation (LDA), generates different topic results for a corpus at different executions. In this paper, we provide a Finding Topic Clusters using Co-occurring Terms (FTCCT) algorithm to automatically generate topics from a short text corpus, and develop an Event Evolution Mining (EEM) algorithm to discover hot events and their evolutions (i.e., the popularity degrees of events changing over time). In FTCCT, a term (i.e., a single word or a multiple-words phrase) belongs to only one topic in a corpus. Experiments on news titles of 157 countries within 4 months (from July to October, 2013) demonstrate that our FTCCT-based method (combining FTCCT and EEM) achieves far higher quality of the event's content and description words than LDA-based method (combining LDA and EEM) for analysis of streams of short text. Our method also visualizes the evolutions of the hot events. The discovered world-wide event evolutions have explored some interesting correlations of the world-wide events; for example, successive extreme weather phenomenon occur in different locations - typhoon in Hong Kong and Philippines followed hurricane and storm flood in Mexico in September 2013. © 2014 Springer Science+Business Media New York.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Due to the serious information overload problem on the Internet, recommender systems have emerged as an important tool for recommending more useful information to users by providing personalized services for individual users. However, in the “big data“ era, recommender systems face significant challenges, such as how to process massive data efficiently and accurately. In this paper we propose an incremental algorithm based on singular value decomposition (SVD) with good scalability, which combines the Incremental SVD algorithm with the Approximating the Singular Value Decomposition (ApproSVD) algorithm, called the Incremental ApproSVD. Furthermore, strict error analysis demonstrates the effectiveness of the performance of our Incremental ApproSVD algorithm. We then present an empirical study to compare the prediction accuracy and running time between our Incremental ApproSVD algorithm and the Incremental SVD algorithm on the MovieLens dataset and Flixster dataset. The experimental results demonstrate that our proposed method outperforms its counterparts.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

An Android application uses a permission system to regulate the access to system resources and users' privacy-relevant information. Existing works have demonstrated several techniques to study the required permissions declared by the developers, but little attention has been paid towards used permissions. Besides, no specific permission combination is identified to be effective for malware detection. To fill these gaps, we have proposed a novel pattern mining algorithm to identify a set of contrast permission patterns that aim to detect the difference between clean and malicious applications. A benchmark malware dataset and a dataset of 1227 clean applications has been collected by us to evaluate the performance of the proposed algorithm. Valuable findings are obtained by analyzing the returned contrast permission patterns. © 2013 Elsevier B.V. All rights reserved.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The autism spectrum disorder (ASD) is increasingly being recognized as a major public health issue which affects approximately 0.5-0.6% of the population. Promoting the general awareness of the disorder, increasing the engagement with the affected individuals and their carers, and understanding the success of penetration of the current clinical recommendations in the target communities, is crucial in driving research as well as policy. The aim of the present work is to investigate if Twitter, as a highly popular platform for information exchange, can be used as a data-mining source which could aid in the aforementioned challenges. Specifically, using a large data set of harvested tweets, we present a series of experiments which examine a range of linguistic and semantic aspects of messages posted by individuals interested in ASD. Our findings, the first of their nature in the published scientific literature, strongly motivate additional research on this topic and present a methodological basis for further work.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In big data analysis, frequent itemsets mining plays a key role in mining associations, correlations and causality. Since some traditional frequent itemsets mining algorithms are unable to handle massive small files datasets effectively, such as high memory cost, high I/O overhead, and low computing performance, we propose a novel parallel frequent itemsets mining algorithm based on the FP-Growth algorithm and discuss its applications in this paper. First, we introduce a small files processing strategy for massive small files datasets to compensate defects of low read-write speed and low processing efficiency in Hadoop. Moreover, we use MapReduce to redesign the FP-Growth algorithm for implementing parallel computing, thereby improving the overall performance of frequent itemsets mining. Finally, we apply the proposed algorithm to the association analysis of the data from the national college entrance examination and admission of China. The experimental results show that the proposed algorithm is feasible and valid for a good speedup and a higher mining efficiency, and can meet the actual requirements of frequent itemsets mining for massive small files datasets. © 2014 ISSN 2185-2766.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

In recent years, evaluating the influence of nodes and finding top-k influential nodes in social networks, has drawn a wide attention and has become a hot-pot research issue. Considering the characteristics of social networks, we present a novel mechanism to mine the top-k influential nodes in mobile social networks. The proposed mechanism is based on the behaviors analysis of SMS/MMS (simple messaging service / multimedia messaging service) communication between mobile users. We introduce the complex network theory to build a social relation graph, which is used to reveal the relationship among people's social contacts and messages sending. Moreover, intimacy degree is also introduced to characterize social frequency among nodes. Election mechanism is hired to find the most influential node, and then a heap sorting algorithm is used to sort the voting results to find the k most influential nodes. The experimental results show that the mechanism can finds out the most influential top-k nodes efficiently and effectively. © 2013 IEEE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The thesis has studied a number of critical problems in data mining for customer behavior analysis and has proposed novel techniques for better modeling of the customers’ decision making process, more efficient analysis of their travel behavior, and more effective identification of their emerging preference.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

This paper examines the relationship between the output levels in the mining sector and various non-mining sectors in an attempt to understand the role of the mining sector in Australia. The unobserved components time series model is used to estimate the effects of the output gap and the growth regime in the mining sector on the output level of each of several non-mining sectors. Overall, the estimates obtained do not suggest an overwhelmingly positive effect running from the mining sector to other production and services sectors, implying that the trickle-down effect of the mining boom may be a myth.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Although agricultural productivity is critical for economic development very little is known about the causes of the large dispersion in agricultural productivity across the world. Microeconomic studies increasingly stress the lack of land rights in many poor countries as an important source of low productivity. This paper examines the role played by land titles in explaining differences in agricultural productivity for 93 countries. Using the per capita accumulated value of gold and silver production in the 16th and 17th centuries as instruments for land rights it is shown that enforcement of land titles is a significant source of agricultural productivity inequality across the world.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

An Association Rule (AR) is a common knowledge model in data mining that describes an implicative cooccurring relationship between two disjoint sets of binary-valued transaction database attributes (items), expressed in the form of an "antecedent⇒ consequent" rule. A variant of the AR is the Weighted Association Rule (WAR). With regard to a marketing context, this paper introduces a new knowledge model in data mining -ALlocating Pattern (ALP). An ALP is a special form of WAR, where each rule item is associated with a weighting score between 0 and 1, and the sum of all rule item scores is 1. It can not only indicate the implicative co-occurring relationship between two (disjoint) sets of items in a weighted setting, but also inform the "allocating" relationship among rule items. ALPs can be demonstrated to be applicable in marketing and possibly a surprising variety of other areas. We further propose an Apriori based algorithm to extract hidden and interesting ALPs from a "one-sum" weighted transaction database. The experimental results show the effectiveness of the proposed algorithm. © 2008 IEEE.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Hotel managers continue to find ways to understand traveler preferences, with the aim of improving their strategic planning, marketing, and product development. Traveler preference is unpredictable for example, hotel guests used to prefer having a telephone in the room, but now favor fast Internet connection. Changes in preference influence the performance of hotel businesses, thus creating the need to identify and address the demands of their guests. Most existing studies focus on current demand attributes and not on emerging ones. Thus, hotel managers may find it difficult to make appropriate decisions in response to changes in travelers' concerns. To address these challenges, this paper adopts Emerging Pattern Mining technique to identify emergent hotel features of interest to international travelers. Data are derived from 118,000 records of online reviews. The methods and findings can help hotel managers gain insights into travelers' interests, enabling the former to gain a better understanding of the rapid changes in tourist preferences.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Cancer remains a major challenge in modern medicine. Increasing prevalence of cancer, particularly in developing countries, demands better understanding of the effectiveness and adverse consequences of different cancer treatment regimes in real patient population. Current understanding of cancer treatment toxicities is often derived from either “clean” patient cohorts or coarse population statistics. It is difficult to get up-to-date and local assessment of treatment toxicities for specific cancer centres. In this paper, we applied an Apriori-based method for discovering toxicity progression patterns in the form of temporal association rules. Our experiments show the effectiveness of the proposed method in discovering major toxicity patterns in comparison with the pairwise association analysis. Our method is applicable for most cancer centres with even rudimentary electronic medical records and has the potential to provide real-time surveillance and quality assurance in cancer care.

Relevância:

20.00% 20.00%

Publicador:

Resumo:

Mobile Health (mHealth) is now emerging with Internet of Things (IoT), Cloud and big data along with the prevalence of smart wearable devices and sensors. There is also the emergence of smart environments such as smart homes, cars, highways, cities, factories and grids. Presently, it is difficult to quickly forecast or prevent urgent health situations in real-time as health data are analyzed offline by a physician. Sensors are expected to be overloaded by demands of providing health data from IoT networks and smart environments. This paper proposes to resolve the problems by introducing an inference system so that life-threatening situations can be prevented in advance based on a short and long term health status prediction. This prediction is inferred from personal health information that is built by big data in Cloud. The inference system can also resolve the problem of data overload in sensor nodes by reducing data volume and frequency to reduce workload in sensor nodes. This paper presents a novel idea of tracking down and predicting a personal health status as well as intelligent functionality of inference in sensor nodes to interface IoT networks

Relevância:

20.00% 20.00%

Publicador:

Resumo:

The low accuracy rates of textshape dividers for digital ink diagrams are hindering their use in real world applications. While recognition of handwriting is well advanced and there have been many recognition approaches proposed for hand drawn sketches, there has been less attention on the division of text and drawing ink. Feature based recognition is a common approach for textshape division. However, the choice of features and algorithms are critical to the success of the recognition. We propose the use of data mining techniques to build more accurate textshape dividers. A comparative study is used to systematically identify the algorithms best suited for the specific problem. We have generated dividers using data mining with diagrams from three domains and a comprehensive ink feature library. The extensive evaluation on diagrams from six different domains has shown that our resulting dividers, using LADTree and LogitBoost, are significantly more accurate than three existing dividers.