130 resultados para data gathering algorithm


Relevância:

30.00% 30.00%

Publicador:

Resumo:

For most data stream applications, the volume of data is too huge to be stored in permanent devices or to be thoroughly scanned more than once. It is hence recognized that approximate answers are usually sufficient, where a good approximation obtained in a timely manner is often better than the exact answer that is delayed beyond the window of opportunity. Unfortunately, this is not the case for mining frequent patterns over data streams where algorithms capable of online processing data streams do not conform strictly to a precise error guarantee. Since the quality of approximate answers is as important as their timely delivery, it is necessary to design algorithms to meet both criteria at the same time. In this paper, we propose an algorithm that allows online processing of streaming data and yet guaranteeing the support error of frequent patterns strictly within a user-specified threshold. Our theoretical and experimental studies show that our algorithm is an effective and reliable method for finding frequent sets in data stream environments when both constraints need to be satisfied.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The eigenvector associated with the smallest eigenvalue of the autocorrelation matrix of input signals is called minor component. Minor component analysis (MCA) is a statistical approach for extracting minor component from input signals and has been applied in many fields of signal processing and data analysis. In this letter, we propose a neural networks learning algorithm for estimating adaptively minor component from input signals. Dynamics of the proposed algorithm are analyzed via a deterministic discrete time (DDT) method. Some sufficient conditions are obtained to guarantee convergence of the proposed algorithm.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The stability of minor component analysis (MCA) learning algorithms is an important problem in many signal processing applications. In this paper, we propose an effective MCA learning algorithm that can offer better stability. The dynamics of the proposed algorithm are analyzed via a corresponding deterministic discrete time (DDT) system. It is proven that if the learning rate satisfies some mild conditions, almost all trajectories of the DDT system starting from points in an invariant set are bounded, and will converge to the minor component of the autocorrelation matrix of the input data. Simulation results will be furnished to illustrate the theoretical results achieved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Minor component analysis (MCA) is an important statistical tool for signal processing and data analysis. Neural networks can be used to extract online minor component from input data. Compared with traditional algebraic  approaches, a neural network method has a lower computational complexity. Stability of neural networks learning algorithms is crucial to practical applications. In this paper, we propose a stable MCA neural networks learning algorithm, which has a more satisfactory numerical stability than some existing MCA algorithms. Dynamical behaviors of the proposed algorithm are analyzed via deterministic discrete time (DDT) method and the conditions are obtained to guarantee convergence. Simulations are carried out to illustrate the theoretical results achieved.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Motivation: A set of genes and their gene expression levels are used to classify disease and normal tissues. Due to the massive number of genes in microarray, there are a large number of edges to divide different classes of genes in microarray space. The edging genes (EGs) can be co-regulated genes, they can also be on the same pathway or deregulated by the same non-coding genes, such as siRNA or miRNA. Every gene in EGs is vital for identifying a tissue's class. The changing in one EG's gene expression may cause a tissue alteration from normal to disease and vice versa. Finding EGs is of biological importance. In this work, we propose an algorithm to effectively find these EGs.

Result
: We tested our algorithm with five microarray datasets. The results are compared with the border-based algorithm which was used to find gene groups and subsequently divide different classes of tissues. Our algorithm finds a significantly larger amount of EGs than does the border-based algorithm. As our algorithm prunes irrelevant patterns at earlier stages, time and space complexities are much less prevalent than in the border-based algorithm.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Recently, much attention has been given to the mass spectrometry (MS) technology based disease classification, diagnosis, and protein-based biomarker identification. Similar to microarray based investigation, proteomic data generated by such kind of high-throughput experiments are often with high feature-to-sample ratio. Moreover, biological information and pattern are compounded with data noise, redundancy and outliers. Thus, the development of algorithms and procedures for the analysis and interpretation of such kind of data is of paramount importance. In this paper, we propose a hybrid system for analyzing such high dimensional data. The proposed method uses the k-mean clustering algorithm based feature extraction and selection procedure to bridge the filter selection and wrapper selection methods. The potential informative mass/charge (m/z) markers selected by filters are subject to the k-mean clustering algorithm for correlation and redundancy reduction, and a multi-objective Genetic Algorithm selector is then employed to identify discriminative m/z markers generated by k-mean clustering algorithm. Experimental results obtained by using the proposed method indicate that it is suitable for m/z biomarker selection and MS based sample classification.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The performance of the modified adaptive conjugate gradient (CG) algorithms based on the iterative CG method for adaptive filtering is highly related to the ways of estimating the correlation matrix and the cross-correlation vector. The existing approaches of implementing the CG algorithms using the data windows of exponential form or sliding form result in either loss of convergence or increase in misadjustment. This paper presents and analyzes a new approach to the implementation of the CG algorithms for adaptive filtering by using a generalized data windowing scheme. For the new modified CG algorithms, we show that the convergence speed is accelerated, the misadjustment and tracking capability comparable to those of the recursive least squares (RLS) algorithm are achieved. Computer simulations demonstrated in the framework of linear system modeling problem show the improvements of the new modifications.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The high-throughput experimental data from the new gene microarray technology has spurred numerous efforts to find effective ways of processing microarray data for revealing real biological relationships among genes. This work proposes an innovative data pre-processing approach to identify noise data in the data sets and eliminate or reduce the impact of the noise data on gene clustering, With the proposed algorithm, the pre-processed data sets make the clustering results stable across clustering algorithms with different similarity metrics, the important information of genes and features is kept, and the clustering quality is improved. The primary evaluation on real microarray data sets has shown the effectiveness of the proposed algorithm.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

One of the fundamental machine learning tasks is that of predictive classification. Given that organisations collect an ever increasing amount of data, predictive classification methods must be able to effectively and efficiently handle large amounts of data. However, it is understood that present requirements push existing algorithms to, and sometimes beyond, their limits since many classification prediction algorithms were designed when currently common data set sizes were beyond imagination. This has led to a significant amount of research into ways of making classification learning algorithms more effective and efficient. Although substantial progress has been made, a number of key questions have not been answered. This dissertation investigates two of these key questions. The first is whether different types of algorithms to those currently employed are required when using large data sets. This is answered by analysis of the way in which the bias plus variance decomposition of predictive classification error changes as training set size is increased. Experiments find that larger training sets require different types of algorithms to those currently used. Some insight into the characteristics of suitable algorithms is provided, and this may provide some direction for the development of future classification prediction algorithms which are specifically designed for use with large data sets. The second question investigated is that of the role of sampling in machine learning with large data sets. Sampling has long been used as a means of avoiding the need to scale up algorithms to suit the size of the data set by scaling down the size of the data sets to suit the algorithm. However, the costs of performing sampling have not been widely explored. Two popular sampling methods are compared with learning from all available data in terms of predictive accuracy, model complexity, and execution time. The comparison shows that sub-sampling generally products models with accuracy close to, and sometimes greater than, that obtainable from learning with all available data. This result suggests that it may be possible to develop algorithms that take advantage of the sub-sampling methodology to reduce the time required to infer a model while sacrificing little if any accuracy. Methods of improving effective and efficient learning via sampling are also investigated, and now sampling methodologies proposed. These methodologies include using a varying-proportion of instances to determine the next inference step and using a statistical calculation at each inference step to determine sufficient sample size. Experiments show that using a statistical calculation of sample size can not only substantially reduce execution time but can do so with only a small loss, and occasional gain, in accuracy. One of the common uses of sampling is in the construction of learning curves. Learning curves are often used to attempt to determine the optimal training size which will maximally reduce execution time while nut being detrimental to accuracy. An analysis of the performance of methods for detection of convergence of learning curves is performed, with the focus of the analysis on methods that calculate the gradient, of the tangent to the curve. Given that such methods can be susceptible to local accuracy plateaus, an investigation into the frequency of local plateaus is also performed. It is shown that local accuracy plateaus are a common occurrence, and that ensuring a small loss of accuracy often results in greater computational cost than learning from all available data. These results cast doubt over the applicability of gradient of tangent methods for detecting convergence, and of the viability of learning curves for reducing execution time in general.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The recent emergence of intelligent agent technology and advances in information gathering have been the important steps forward in efficiently managing and using the vast amount of information now available on the Web to make informed decisions. There are, however, still many problems that need to be overcome in the information gathering research arena to enable the delivery of relevant information required by end users. Good decisions cannot be made without sufficient, timely, and correct information. Traditionally it is said that knowledge is power, however, nowadays sufficient, timely, and correct information is power. So gathering relevant information to meet user information needs is the crucial step for making good decisions. The ideal goal of information gathering is to obtain only the information that users need (no more and no less). However, the volume of information available, diversity formats of information, uncertainties of information, and distributed locations of information (e.g. World Wide Web) hinder the process of gathering the right information to meet the user needs. Specifically, two fundamental issues in regard to efficiency of information gathering are mismatch and overload. The mismatch means some information that meets user needs has not been gathered (or missed out), whereas, the overload means some gathered information is not what users need. Traditional information retrieval has been developed well in the past twenty years. The introduction of the Web has changed people's perceptions of information retrieval. Usually, the task of information retrieval is considered to have the function of leading the user to those documents that are relevant to his/her information needs. The similar function in information retrieval is to filter out the irrelevant documents (or called information filtering). Research into traditional information retrieval has provided many retrieval models and techniques to represent documents and queries. Nowadays, information is becoming highly distributed, and increasingly difficult to gather. On the other hand, people have found a lot of uncertainties that are contained in the user information needs. These motivate the need for research in agent-based information gathering. Agent-based information systems arise at this moment. In these kinds of systems, intelligent agents will get commitments from their users and act on the users behalf to gather the required information. They can easily retrieve the relevant information from highly distributed uncertain environments because of their merits of intelligent, autonomy and distribution. The current research for agent-based information gathering systems is divided into single agent gathering systems, and multi-agent gathering systems. In both research areas, there are still open problems to be solved so that agent-based information gathering systems can retrieve the uncertain information more effectively from the highly distributed environments. The aim of this thesis is to research the theoretical framework for intelligent agents to gather information from the Web. This research integrates the areas of information retrieval and intelligent agents. The specific research areas in this thesis are the development of an information filtering model for single agent systems, and the development of a dynamic belief model for information fusion for multi-agent systems. The research results are also supported by the construction of real information gathering agents (e.g., Job Agent) for the Internet to help users to gather useful information stored in Web sites. In such a framework, information gathering agents have abilities to describe (or learn) the user information needs, and act like users to retrieve, filter, and/or fuse the information. A rough set based information filtering model is developed to address the problem of overload. The new approach allows users to describe their information needs on user concept spaces rather than on document spaces, and it views a user information need as a rough set over the document space. The rough set decision theory is used to classify new documents into three regions: positive region, boundary region, and negative region. Two experiments are presented to verify this model, and it shows that the rough set based model provides an efficient approach to the overload problem. In this research, a dynamic belief model for information fusion in multi-agent environments is also developed. This model has a polynomial time complexity, and it has been proven that the fusion results are belief (mass) functions. By using this model, a collection fusion algorithm for information gathering agents is presented. The difficult problem for this research is the case where collections may be used by more than one agent. This algorithm, however, uses the technique of cooperation between agents, and provides a solution for this difficult problem in distributed information retrieval systems. This thesis presents the solutions to the theoretical problems in agent-based information gathering systems, including information filtering models, agent belief modeling, and collection fusions. It also presents solutions to some of the technical problems in agent-based information systems, such as document classification, the architecture for agent-based information gathering systems, and the decision in multiple agent environments. Such kinds of information gathering agents will gather relevant information from highly distributed uncertain environments.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

This paper examines the practical construction of k-Lipschitz triangular norms and conorms from empirical data. We apply a characterization of such functions based on k-convex additive generators and translate k-convexity of piecewise linear strictly decreasing functions into a simple set of linear inequalities on their coefficients. This is the basis of a simple linear spline-fitting algorithm, which guarantees k-Lipschitz property of the resulting triangular norms and conorms.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Cluster analysis has played a key role in data stream understanding. The problem is difficult when the clustering task is considered in a sliding window model in which the requirement of outdated data elimination must be dealt with properly. We propose SWEM algorithm that is designed based on the Expectation Maximization technique to address these challenges. Equipped in SWEM is the capability to compute clusters incrementally using a small number of statistics summarized over the stream and the capability to adapt to the stream distribution’s changes. The feasibility of SWEM has been verified via a number of experiments and we show that it is superior than Clustream algorithm, for both synthetic and real datasets.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Feature selection is an important technique in dealing with application problems with large number of variables and limited training samples, such as image processing, combinatorial chemistry, and microarray analysis. Commonly employed feature selection strategies can be divided into filter and wrapper. In this study, we propose an embedded two-layer feature selection approach to combining the advantages of filter and wrapper algorithms while avoiding their drawbacks. The hybrid algorithm, called GAEF (Genetic Algorithm with embedded filter), divides the feature selection process into two stages. In the first stage, Genetic Algorithm (GA) is employed to pre-select features while in the second stage a filter selector is used to further identify a small feature subset for accurate sample classification. Three benchmark microarray datasets are used to evaluate the proposed algorithm. The experimental results suggest that this embedded two-layer feature selection strategy is able to improve the stability of the selection results as well as the sample classification accuracy.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

Background
Medical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset.

Results
One important finding of this study is that different classifiers and metrics often provide different evaluation results. Nevertheless, the proposed hybrid system demonstrates consistent improvements over several alternative methods with three different metrics. The sampling results also demonstrate good generalization on different types of classification algorithms, indicating the advantage of information fusion applied in the hybrid system.

Conclusion
The experimental results demonstrate that unlike many currently available methods which often perform unevenly with different datasets the proposed hybrid system has a better generalization property which alleviates the method-data dependency problem. From the biological perspective, the system provides indication for further investigation of the highly ranked samples, which may result in the discovery of new conditions or disease subtypes.

Relevância:

30.00% 30.00%

Publicador:

Resumo:

The key nodes in network play the critical role in system recovery and survival. Many traditional key nodes selection algorithms utilize the characters of the physical topology to find the key nodes. But they can hardly succeed in the mobile ad hoc network due to the mobility nature of the network. In this paper we propose a social-aware Kcore selection algorithm to work in the Pocket Switched Network. The social view of the network suggests the social position of the mobile nodes can help to find the key nodes in the Pocket Switched Network. The S-Kcore selection algorithm is designed to exploit the nodes' social features to improve the performance in data communication. Experiments use the NS2 shows S-Kcore selection algorithm workable in the Pocket Switched Network. Furthermore, with the social behavior information, those key nodes are more suitable to represent and improve the whole network's performance.