98 resultados para data Mining


Relevância:

80.00% 80.00%

Publicador:

Resumo:

For most data stream applications, the volume of data is too huge to be stored in permanent devices or to be thoroughly scanned more than once. It is hence recognized that approximate answers are usually sufficient, where a good approximation obtained in a timely manner is often better than the exact answer that is delayed beyond the window of opportunity. Unfortunately, this is not the case for mining frequent patterns over data streams where algorithms capable of online processing data streams do not conform strictly to a precise error guarantee. Since the quality of approximate answers is as important as their timely delivery, it is necessary to design algorithms to meet both criteria at the same time. In this paper, we propose an algorithm that allows online processing of streaming data and yet guaranteeing the support error of frequent patterns strictly within a user-specified threshold. Our theoretical and experimental studies show that our algorithm is an effective and reliable method for finding frequent sets in data stream environments when both constraints need to be satisfied.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

The problem of extracting infrequent patterns from streams and building associations between these patterns is becoming increasingly relevant today as many events of interest such as attacks in network data or unusual stories in news data occur rarely. The complexity of the problem is compounded when a system is required to deal with data from multiple streams. To address these problems, we present a framework that combines the time based association mining with a pyramidal structure that allows a rolling analysis of the stream and maintains a synopsis of the data without requiring increasing memory resources. We apply the algorithms and show the usefulness of the techniques. © 2007 Crown Copyright.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

Whilst atom probe tomography (APT) is a powerful technique with the capacity to gather information containing hundreds of millions of atoms from a single specimen, the ability to effectively use this information creates significant challenges. The main technological bottleneck lies in handling the extremely large amounts of data on spatial-chemical correlations, as well as developing new quantitative computational foundations for image reconstruction that target critical and transformative problems in materials science. The power to explore materials at the atomic scale with the extraordinary level of sensitivity of detection offered by atom probe tomography has not been not fully harnessed due to the challenges of dealing with missing, sparse and often noisy data. Hence there is a profound need to couple the analytical tools to deal with the data challenges with the experimental issues associated with this instrument. In this paper we provide a summary of some key issues associated with the challenges, and solutions to extract or "mine" fundamental materials science information from that data.

Relevância:

80.00% 80.00%

Publicador:

Resumo:

With the advanced technology of medical devices and sensors, an abundance of medical data streams are available. However, data analysis techniques are very limited, especially for processing massive multiple physiological streams that may only be understood by medical experts. The state-of-the-art techniques only allow multiple medical devices to independently monitor different physiological parameters for the patient's status, thus they signal too many false alarms, creating unnecessary noise, especially in the Intensive Care Unit (ICU). An effective solution which has been recently studied is to integrate information from multiple physiologic parameters to reduce alarms. But it is a challenge to detect abnormalities from high frequently changed physiological streams data, since abnormalities occur gradually due to the complex situation of patients. An analysis of ICU physiological data streams shows that many vital physiological parameters are changed periodically (such as heart rate, arterial pressure, and respiratory impedance) and thus abnormalities are generally abnormal period patterns. In this paper, we develop a Mining Abnormal Period Patterns from Multiple Physiological Streams (MAPPMPS) method to detect and rank abnormalities in medical sensor streams. The efficiency and effectiveness of the MAPPMPS method is demonstrated by a real-world massive database of multiple physiological streams sampled in ICU, comprising 250 patients' streams (each stream involving over 1.3 million data points) with a total size of 28 GB data.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

While knowledge discovery in databases (KDD) is defined as an iterative sequence of the following steps: data pre-processing, data mining, and post data mining, a significant amount of research in data mining has been done, resulting in a variety of algorithms and techniques for each step. However, a single data-mining technique has not been proven appropriate for every domain and data set. Instead, several techniques may need to be integrated into hybrid systems and used cooperatively during a particular data-mining operation. That is, hybrid solutions are crucial for the success of data mining. This paper presents a hybrid framework for identifying patterns from databases or multi-databases. The framework integrates these techniques for mining tasks from an agent point of view. Based on the experiments conducted, putting different KDD techniques together into the agent-based architecture enables them to be used cooperatively when needed. The proposed framework provides a highly flexible and robust data-mining platform and the resulting systems demonstrate emergent behaviors although it does not improve the performance of individual KDD techniques.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Current studies to analyzing security protocols using formal methods require users to predefine authentication goals. Besides, they are unable to discover potential correlations between secure messages. This research attempts to analyze security protocols using data mining. This is done by extending the idea of association rule mining and converting the verification of protocols into computing the frequency and confidence of inconsistent secure messages. It provides a novel and efficient way to analyze security protocols and find out potential correlations between secure messages. The conducted experiments demonstrate our approaches.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Background
AMP-activated protein kinase (AMPK) has emerged as a significant signaling intermediary that regulates metabolisms in response to energy demand and supply. An investigation into the degree of activation and deactivation of AMPK subunits under exercise can provide valuable data for understanding AMPK. In particular, the effect of AMPK on muscle cellular energy status makes this protein a promising pharmacological target for disease treatment. As more AMPK regulation data are accumulated, data mining techniques can play an important role in identifying frequent patterns in the data. Association rule mining, which is commonly used in market basket analysis, can be applied to AMPK regulation.

Results
This paper proposes a framework that can identify the potential correlation, either between the state of isoforms of α, β and γ subunits of AMPK, or between stimulus factors and the state of isoforms. Our approach is to apply item constraints in the closed interpretation to the itemset generation so that a threshold is specified in terms of the amount of results, rather than a fixed threshold value for all itemsets of all sizes. The derived rules from experiments are roughly analyzed. It is found that most of the extracted association rules have biological meaning and some of them were previously unknown. They indicate direction for further research.

Conclusion
Our findings indicate that AMPK has a great impact on most metabolic actions that are related to energy demand and supply. Those actions are adjusted via its subunit isoforms under specific physical training. Thus, there are strong co-relationships between AMPK subunit isoforms and exercises. Furthermore, the subunit isoforms are correlated with each other in some cases. The methods developed here could be used when predicting these essential relationships and enable an understanding of the functions and metabolic pathways regarding AMPK.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Protein kinases, a family of enzymes, have been viewed as an important signaling intermediary by living organisms for regulating critical biological processes such as memory, hormone response and cell growth. The
unbalanced kinases are known to cause cancer and other diseases. With the increasing efforts to collect, store and disseminate information about the entire kinase family, it not only leads to valuable data set to understand cell regulation but also poses a big challenge to extract valuable knowledge about metabolic pathway from the data. Data mining techniques that have been widely used to find frequent patterns in large datasets can be extended and adapted to kinase data as well. This paper proposes a framework for mining frequent itemsets from the collected kinase dataset. An experiment using AMPK regulation data demonstrates that our approaches are useful and efficient in analyzing kinase regulation data.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Current data mining techniques may not be helpful for mining some companies/organizations such as nuclear power plants and earthquake bureaus, which have only small databases. Apparently, these companies/organizations also expect to apply data mining techniques to extract useful patterns in their databases so as to make their decisions. However, data in these databases such as the accident database of a nuclear power plant and the earthquake database in an earthquake bureau, may not be large enough to form any patterns. To meet the applications, we present a new mining model in this paper, which is based on the collecting knowledge from such as Web, journals, and newspapers.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

One common drawback in algorithms for learning Linear Causal Models is that they can not deal with incomplete data set. This is unfortunate since many real problems involve missing data or even hidden variable. In this paper, based on multiple imputation, we propose a three-step process to learn linear causal models from incomplete data set. Experimental results indicate that this algorithm is better than the single imputation method (EM algorithm) and the simple list deletion method, and for lower missing rate, this algorithm can even find models better than the results from the greedy learning algorithm MLGS working in a complete data set. In addition, the method is amenable to parallel or distributed processing, which is an important characteristic for data mining in large data sets.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

Determining the causal relation among attributes in a domain
is a key task in the data mining and knowledge discovery. In this
paper, we applied a causal discovery algorithm to the business traveler
expenditure survey data [1]. A general class of causal models is adopted in
this paper to discover the causal relationship among continuous and discrete variables. All those factors which have direct effect on the expense
pattern of travelers could be detected. Our discovery results reinforced
some conclusions of the rough set analysis and found some new conclusions which might significantly improve the understanding of expenditure behaviors of the business traveler.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

This paper proposes a conceptual matrix model with algorithms for biological data processing. The required elements for constructing a matrix model are discussed. The representative matrix-based methods and algorithms which have potentials in biological data processing are presented / proposed. Some application cases of the model in biological data processing are studied, which show the applicability of this model in various kinds of biological data processing. This conceptual model established a framework within which biological data processing and mining could be conducted. The model is also heuristic to other applications.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

In this paper, we propose a model for discovering frequent sequential patterns, phrases, which can be used as profile descriptors of documents. It is indubitable that we can obtain numerous phrases using data mining algorithms. However, it is difficult to use these phrases effectively for answering what users want. Therefore, we present a pattern taxonomy extraction model which performs the task of extracting descriptive frequent sequential patterns by pruning the meaningless ones. The model then is extended and tested by applying it to the information filtering system. The results of the experiment show that pattern-based methods outperform the keyword-based methods. The results also indicate that removal of meaningless patterns not only reduces the cost of computation but also improves the effectiveness of the system.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

The high-throughput experimental data from the new gene microarray technology has spurred numerous efforts to find effective ways of processing microarray data for revealing real biological relationships among genes. This work proposes an innovative data pre-processing approach to identify noise data in the data sets and eliminate or reduce the impact of the noise data on gene clustering, With the proposed algorithm, the pre-processed data sets make the clustering results stable across clustering algorithms with different similarity metrics, the important information of genes and features is kept, and the clustering quality is improved. The primary evaluation on real microarray data sets has shown the effectiveness of the proposed algorithm.

Relevância:

70.00% 70.00%

Publicador:

Resumo:

One of the fundamental machine learning tasks is that of predictive classification. Given that organisations collect an ever increasing amount of data, predictive classification methods must be able to effectively and efficiently handle large amounts of data. However, it is understood that present requirements push existing algorithms to, and sometimes beyond, their limits since many classification prediction algorithms were designed when currently common data set sizes were beyond imagination. This has led to a significant amount of research into ways of making classification learning algorithms more effective and efficient. Although substantial progress has been made, a number of key questions have not been answered. This dissertation investigates two of these key questions. The first is whether different types of algorithms to those currently employed are required when using large data sets. This is answered by analysis of the way in which the bias plus variance decomposition of predictive classification error changes as training set size is increased. Experiments find that larger training sets require different types of algorithms to those currently used. Some insight into the characteristics of suitable algorithms is provided, and this may provide some direction for the development of future classification prediction algorithms which are specifically designed for use with large data sets. The second question investigated is that of the role of sampling in machine learning with large data sets. Sampling has long been used as a means of avoiding the need to scale up algorithms to suit the size of the data set by scaling down the size of the data sets to suit the algorithm. However, the costs of performing sampling have not been widely explored. Two popular sampling methods are compared with learning from all available data in terms of predictive accuracy, model complexity, and execution time. The comparison shows that sub-sampling generally products models with accuracy close to, and sometimes greater than, that obtainable from learning with all available data. This result suggests that it may be possible to develop algorithms that take advantage of the sub-sampling methodology to reduce the time required to infer a model while sacrificing little if any accuracy. Methods of improving effective and efficient learning via sampling are also investigated, and now sampling methodologies proposed. These methodologies include using a varying-proportion of instances to determine the next inference step and using a statistical calculation at each inference step to determine sufficient sample size. Experiments show that using a statistical calculation of sample size can not only substantially reduce execution time but can do so with only a small loss, and occasional gain, in accuracy. One of the common uses of sampling is in the construction of learning curves. Learning curves are often used to attempt to determine the optimal training size which will maximally reduce execution time while nut being detrimental to accuracy. An analysis of the performance of methods for detection of convergence of learning curves is performed, with the focus of the analysis on methods that calculate the gradient, of the tangent to the curve. Given that such methods can be susceptible to local accuracy plateaus, an investigation into the frequency of local plateaus is also performed. It is shown that local accuracy plateaus are a common occurrence, and that ensuring a small loss of accuracy often results in greater computational cost than learning from all available data. These results cast doubt over the applicability of gradient of tangent methods for detecting convergence, and of the viability of learning curves for reducing execution time in general.