98 resultados para data Mining


Relevância:

70.00% 70.00%

Publicador:

Resumo:

The thesis has researched a set of critical problems in data mining and has proposed four advanced pattern mining algorithm to discover the most interesting and useful data patterns highly relevant to the user’s application targets from the data is represented in complex structures.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Joint analysis of multiple data sources is becoming increasingly popular in transfer learning, multi-task learning and cross-domain data mining. One promising approach to model the data jointly is through learning the shared and individual factor subspaces. However, performance of this approach depends on the subspace dimensionalities and the level of sharing needs to be specified a priori. To this end, we propose a nonparametric joint factor analysis framework for modeling multiple related data sources. Our model utilizes the hierarchical beta process as a nonparametric prior to automatically infer the number of shared and individual factors. For posterior inference, we provide a Gibbs sampling scheme using auxiliary variables. The effectiveness of the proposed framework is validated through its application on two real world problems - transfer learning in text and image retrieval.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Discovering frequent patterns plays an essential role in many data mining applications. The aim of frequent patterns is to obtain the information about the most common patterns that appeared together. However, designing an efficient model to mine these patterns is still demanding due to the capacity of current database size. Therefore, we propose an Efficient Frequent Pattern Mining Model (EFP-M2) to mine the frequent patterns in timely manner. The result shows that the algorithm in EFP-M2l is outperformed at least at 2 orders of magnitudes against the benchmarked FP-Growth.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Even with the presence of modern obstetric care, stillbirth rate seems to stay stagnant or has even risen slightly in countries such as England and has become a significant public health concern [1]. In the light of current medical research, maternal risk factors such as diabetes and hypertensive disease were identified as possible risk factors and are taken into consideration in antenatal care. However, medical practitioners and researchers suspect possible relationships between trends in maternal demographics, antenatal care and pregnancy information of current stillbirth in consideration [2]. Although medical data and knowledge is available appropriate computing techniques to analyze the data may lead to identification of high risk groups. In this paper we use an unsupervised clustering technique called Growing Self organizing Map (GSOM) to analyse the stillbirth data and present patterns which can be important to medical researchers.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

We examine a recent proposal for data-privatization by testing it against well-known attacks, we show that all of these attacks successfully retrieve a relatively large (and unacceptable) portion of the original data. We then indicate how the data-privatization method examined can be modified to assist it to withstand these attacks and compare the performance of the two approaches. We also show that the new method has better privacy and lower information loss than the former method.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Human associated delay-tolerant networks (HDTNs) are new networks for DTNs, where mobile devices are associated with humans and demonstrate social related communication characteristics. As most of recent works use real social trace files to study the date forwarding in HDTNs, the privacy protection becomes a serious issue. Traditional privacy protections need to keep the attributes semantics, such as data mining and information retrieval. However, in HDTNs, it is not necessary to keep these meaningful semantics. In this paper, instead, we propose to anonymize the original data by coding to preserve individual's privacy and apply Privacy Protected Data Forwarding (PPDF) model to select the top N nodes to perform the multicast. We use both MIT Reality and Infocom 06 datasets, which are human associated mobile network trace file, to simulate our model. The results of our simulations show that this method can achieve a high data forwarding performance while protect the nodes' privacy as well.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Differential privacy is a strong definition for protecting individual privacy in data releasing and mining. However, it is a rigid definition introducing a large amount of noise to the original dataset, which significantly decreases the quality of data mining results. Recently, how to design a suitable data releasing algorithm for data mining purpose is a hot research area. In this paper, we propose a differential private data releasing algorithm for decision tree construction. The proposed algorithm provides a non-interactive data releasing method through which miner can obtain the complete dataset for data mining purpose. With a given privacy budget, the proposed algorithm generalizes the original dataset, and then specializes it in a differential privacy constrain to construct decision trees. As the designed novel scheme selection operation can fully utilize the allocated privacy budget, the data set released by the proposed algorithm can yield better decision tree models than other method. Experimental results demonstrate that the proposed algorithm outperforms existing methods for private decision tree construction.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The exponential increase in data, computing power and the availability of readily accessible analytical software has allowed organisations around the world to leverage the benefits of integrating multiple heterogeneous data files for enterprise-level planning and decision making. Benefits from effective data integration to the health and medical research community include more trustworthy research, higher service quality, improved personnel efficiency, reduction of redundant tasks, facilitation of auditing and more timely, relevant and specific information. The costs of poor quality processes elevate the risk of erroneous outcomes, an erosion of confidence in the data and the organisations using these data. To date there are no documented set of standards for best practice integration of heterogeneous data files for research purposes. Therefore, the aim of this paper is to describe a set of clear protocol for data file integration (Data Integration Protocol In Ten-steps; DIPIT) translational to any field of research.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Supervisory Control and Data Acquisition (SCADA) systems control and monitor industrial and critical infrastructure functions, such as electricity, gas, water, waste, railway, and traffic. Recent attacks on SCADA systems highlight the need for stronger SCADA security. Thus, sharing SCADA traffic data has become a vital requirement in SCADA systems to analyze security risks and develop appropriate security solutions. However, inappropriate sharing and usage of SCADA data could threaten the privacy of companies and prevent sharing of data. In this paper, we present a privacy preserving strategy-based permutation technique called PPFSCADA framework, in which data privacy, statistical properties and data mining utilities can be controlled at the same time. In particular, our proposed approach involves: (i) vertically partitioning the original data set to improve the performance of perturbation; (ii) developing a framework to deal with various types of network traffic data including numerical, categorical and hierarchical attributes; (iii) grouping the portioned sets into a number of clusters based on the proposed framework; and (iv) the perturbation process is accomplished by the alteration of the original attribute value by a new value (clusters centroid). The effectiveness of the proposed PPFSCADA framework is shown through several experiments on simulated SCADA, intrusion detection and network traffic data sets. Through experimental analysis, we show that PPFSCADA effectively deals with multivariate traffic attributes, producing compatible results as the original data, and also substantially improving the performance of the five supervised approaches and provides high level of privacy protection. © 2014 Published by Elsevier B.V. All rights reserved.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

An Android application uses a permission system to regulate the access to system resources and users' privacy-relevant information. Existing works have demonstrated several techniques to study the required permissions declared by the developers, but little attention has been paid towards used permissions. Besides, no specific permission combination is identified to be effective for malware detection. To fill these gaps, we have proposed a novel pattern mining algorithm to identify a set of contrast permission patterns that aim to detect the difference between clean and malicious applications. A benchmark malware dataset and a dataset of 1227 clean applications has been collected by us to evaluate the performance of the proposed algorithm. Valuable findings are obtained by analyzing the returned contrast permission patterns. © 2013 Elsevier B.V. All rights reserved.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The thesis has studied a number of critical problems in data mining for customer behavior analysis and has proposed novel techniques for better modeling of the customers’ decision making process, more efficient analysis of their travel behavior, and more effective identification of their emerging preference.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

An Association Rule (AR) is a common knowledge model in data mining that describes an implicative cooccurring relationship between two disjoint sets of binary-valued transaction database attributes (items), expressed in the form of an "antecedent⇒ consequent" rule. A variant of the AR is the Weighted Association Rule (WAR). With regard to a marketing context, this paper introduces a new knowledge model in data mining -ALlocating Pattern (ALP). An ALP is a special form of WAR, where each rule item is associated with a weighting score between 0 and 1, and the sum of all rule item scores is 1. It can not only indicate the implicative co-occurring relationship between two (disjoint) sets of items in a weighted setting, but also inform the "allocating" relationship among rule items. ALPs can be demonstrated to be applicable in marketing and possibly a surprising variety of other areas. We further propose an Apriori based algorithm to extract hidden and interesting ALPs from a "one-sum" weighted transaction database. The experimental results show the effectiveness of the proposed algorithm. © 2008 IEEE.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Hotel managers continue to find ways to understand traveler preferences, with the aim of improving their strategic planning, marketing, and product development. Traveler preference is unpredictable for example, hotel guests used to prefer having a telephone in the room, but now favor fast Internet connection. Changes in preference influence the performance of hotel businesses, thus creating the need to identify and address the demands of their guests. Most existing studies focus on current demand attributes and not on emerging ones. Thus, hotel managers may find it difficult to make appropriate decisions in response to changes in travelers' concerns. To address these challenges, this paper adopts Emerging Pattern Mining technique to identify emergent hotel features of interest to international travelers. Data are derived from 118,000 records of online reviews. The methods and findings can help hotel managers gain insights into travelers' interests, enabling the former to gain a better understanding of the rapid changes in tourist preferences.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Privacy preserving on data mining and data release has attracted an increasing research interest over a number of decades. Differential privacy is one influential privacy notion that offers a rigorous and provable privacy guarantee for data mining and data release. Existing studies on differential privacy assume that in a data set, records are sampled independently. However, in real-world applications, records in a data set are rarely independent. The relationships among records are referred to as correlated information and the data set is defined as correlated data set. A differential privacy technique performed on a correlated data set will disclose more information than expected, and this is a serious privacy violation. Although recent research was concerned with this new privacy violation, it still calls for a solid solution for the correlated data set. Moreover, how to decrease the large amount of noise incurred via differential privacy in correlated data set is yet to be explored. To fill the gap, this paper proposes an effective correlated differential privacy solution by defining the correlated sensitivity and designing a correlated data releasing mechanism. With consideration of the correlated levels between records, the proposed correlated sensitivity can significantly decrease the noise compared with traditional global sensitivity. The correlated data releasing mechanism correlated iteration mechanism is designed based on an iterative method to answer a large number of queries. Compared with the traditional method, the proposed correlated differential privacy solution enhances the privacy guarantee for a correlated data set with less accuracy cost. Experimental results show that the proposed solution outperforms traditional differential privacy in terms of mean square error on large group of queries. This also suggests the correlated differential privacy can successfully retain the utility while preserving the privacy.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Medical outcomes are inexorably linked to patient illness and clinical interventions. Interventions change the course of disease, crucially determining outcome. Traditional outcome prediction models build a single classifier by augmenting interventions with disease information. Interventions, however, differentially affect prognosis, thus a single prediction rule may not suffice to capture variations. Interventions also evolve over time as more advanced interventions replace older ones. To this end, we propose a Bayesian nonparametric, supervised framework that models a set of intervention groups through a mixture distribution building a separate prediction rule for each group, and allows the mixture distribution to change with time. This is achieved by using a hierarchical Dirichlet process mixture model over the interventions. The outcome is then modeled as conditional on both the latent grouping and the disease information through a Bayesian logistic regression. Experiments on synthetic and medical cohorts for 30-day readmission prediction demonstrate the superiority of the proposed model over clinical and data mining baselines.