130 resultados para data gathering algorithm


30.00% 30.00%



One common drawback in algorithms for learning Linear Causal Models is that they can not deal with incomplete data set. This is unfortunate since many real problems involve missing data or even hidden variable. In this paper, based on multiple imputation, we propose a three-step process to learn linear causal models from incomplete data set. Experimental results indicate that this algorithm is better than the single imputation method (EM algorithm) and the simple list deletion method, and for lower missing rate, this algorithm can even find models better than the results from the greedy learning algorithm MLGS working in a complete data set. In addition, the method is amenable to parallel or distributed processing, which is an important characteristic for data mining in large data sets.


30.00% 30.00%



Most of the current web-based application systems suffer from poor performance and costly heterogeneous accessing. Distributed or replicated strategies can alleviate the problem in some degree, but there are still some problems of the distributed or replicated model, such as data synchronization, load balance, and so on.  In this paper, we propose a novel architecture for Internet-based data processing system based on multicast and anycast protocols. The proposed architecture breaks the functionalities of existing data processing system, in particular, the database functionality, into several agents. These agents communicate with each other using multicast and anycast mechanisms. We show that the proposed architecture provides better scalability, robustness, automatic load balance, and performance than the current distributed architecture of Internet-based data


30.00% 30.00%



This paper researches seismic signals of typical vehicle targets in order to extract features and to recognize vehicle targets. As a data fusion method, the technique of artificial neural networks combined with genetic algorithm(ANNCGA) is applied for recognition of seismic signals that belong to different kinds of vehicle targets. The technique of ANNCGA and its architecture have been presented. The algorithm had been used for classification and recognition of seismic signals of vehicle targets in the outdoor environment. Through experiments, it can be proven that seismic properties of target acquired are correct, ANNCGA data fusion method is effective to solve the problem of target recognition.


30.00% 30.00%



Data mining refers to extracting or "mining" knowledge from large amounts of data. It is an increasingly popular field that uses statistical, visualization, machine learning, and other data manipulation and knowledge extraction techniques aimed at gaining an insight into the relationships and patterns hidden in the data. Availability of digital data within picture archiving and communication systems raises a possibility of health care and research enhancement associated with manipulation, processing and handling of data by computers.That is the basis for computer-assisted radiology development. Further development of computer-assisted radiology is associated with the use of new intelligent capabilities such as multimedia support and data mining in order to discover the relevant knowledge for diagnosis. It is very useful if results of data mining can be communicated to humans in an understandable way. In this paper, we present our work on data mining in medical image archiving systems. We investigate the use of a very efficient data mining technique, a decision tree, in order to learn the knowledge for computer-assisted image analysis. We apply our method to the classification of x-ray images for lung cancer diagnosis. The proposed technique is based on an inductive decision tree learning algorithm that has low complexity with high transparency and accuracy. The results show that the proposed algorithm is robust, accurate, fast, and it produces a comprehensible structure, summarizing the knowledge it induces.


30.00% 30.00%



A mobile robot employed for data collection is faced with the problem of travelling from an initial location to a final location while maintaining as close a distance as possible to all the sensors at a given time in the journey. Here we employ optimal control ideas in forming the necessary control commands for such a robot resulting not only the necessary acceleration commands for the underlying robot, but also the resulting trajectory. This approach can also be easily extended for the case of producing the optimal trajectory for an ariel vehicle used for data collection from indiscriminately scattered ad-hoc sensors located on the ground. We demonstrate the implementation of our algorithm using a Pioneer 3-AT robot.


30.00% 30.00%



Sensor networks are emerging as the new frontier in sensing technology, however there are still issues that need to be addressed. Two such issues are data collection and energy conservation. We consider a mobile robot, or a mobile agent, traveling the network collecting information from the sensors themselves before their onboard memory storage buffers are full. A novel algorithm is presented that is an adaptation of a local search algorithm for a special case of the Asymmetric Traveling Salesman Problem with Time-windows (ATSPTW) for solving the dynamic scheduling problem of what nodes are to be visited so that the information collected is not lost. Our algorithms are given and compared to other work.


30.00% 30.00%



Determining the causal relation among attributes in a domain
is a key task in the data mining and knowledge discovery. In this
paper, we applied a causal discovery algorithm to the business traveler
expenditure survey data [1]. A general class of causal models is adopted in
this paper to discover the causal relationship among continuous and discrete variables. All those factors which have direct effect on the expense
pattern of travelers could be detected. Our discovery results reinforced
some conclusions of the rough set analysis and found some new conclusions which might significantly improve the understanding of expenditure behaviors of the business traveler.


30.00% 30.00%



One of the key applications of microarray studies is to select and classify gene expression profiles of cancer and normal subjects. In this study, two hybrid approaches–genetic algorithm with decision tree (GADT) and genetic algorithm with neural network (GANN)–are utilized to select optimal gene sets which contribute to the highest classification accuracy. Two benchmark microarray datasets were tested, and the most significant disease related genes have been identified. Furthermore, the selected gene sets achieved comparably high sample classification accuracy (96.79% and 94.92% in colon cancer dataset, 98.67% and 98.05% in leukemia dataset) compared with those obtained by mRMR algorithm. The study results indicate that these two hybrid methods are able to select disease related genes and improve classification accuracy.


30.00% 30.00%



Visualization is one of the most effective methods for analyzing how high-dimensional data are distributed. Dimensionality reduction techniques, such as PCA, can be used to map high dimensional data to a two- or three-dimensional space. In this paper, we propose an algorithm called HyperMap that can be effectively applied to visualization. Our algorithm can be seen as a generalization of FastMap. It preserves its linear computation complexity, and overcomes several main shortcomings, especially in visualization. Since there are more than two pivot objects in each axis of a target space, more distance information needs to be preserved in each dimension. Then in visualization, the number of pivot objects can go beyond the limitation of six (2-pivot objects × 3-dimensions). Our HyperMap algorithm also gives more flexibility to the target space, such that the data distribution can be observed from various viewpoints. Its effectiveness is confirmed by empirical evaluations on both real and synthetic datasets.


30.00% 30.00%



Data streams are usually generated in an online fashion characterized by huge volume, rapid unpredictable rates, and fast changing data characteristics. It has been hence recognized that mining over streaming data requires the problem of limited computational resources to be adequately addressed. Since the arrival rate of data streams can significantly increase and exceed the CPU capacity, the machinery must adapt to this change to guarantee the timeliness of the results. We present an online algorithm to approximate a set of frequent patterns from a sliding window over the underlying data stream - given apriori CPU capacity. The algorithm automatically detects overload situations and can adaptively shed unprocessed data to guarantee the timely results. We theoretically prove, using probabilistic and deterministic techniques, that the error on the output results is bounded within a pre-specified threshold. The empirical results on various datasets also confirmed the feasiblity of our proposal.


30.00% 30.00%



This paper studied a new type of network model; it is formed by the dynamic autonomy area, the structured source servers and the proxy servers. The new network model satisfies the dynamics within the autonomy area, where each node undertakes different tasks according to their different abilities, to ensure that each node has the load ability fit its own; it does not need to exchange information via the central servers, so it can carry out the efficient data transmission and routing search. According to the highly dynamics of the autonomy area, we established dynamic tree structure-proliferation system routing and resource-search algorithms and simulated these algorithms. Test results show the performance of the proposed network model and the algorithms are very stable.


30.00% 30.00%



A retrospective assessment of exposure to benzene was carried out for a nested case control study of lympho-haematopoietic cancers, including leukaemia, in the Australian petroleum industry. Each job or task in the industry was assigned a Base Estimate (BE) of exposure derived from task-based personal exposure assessments carried out by the company occupational hygienists. The BEs corresponded to the estimated arithmetic mean exposure to benzene for each job or task and were used in a deterministic algorithm to estimate the exposure of subjects in the study. Nearly all of the data sets underlying the BEs were found to contain some values below the limit of detection (LOD) of the sampling and analytical methods and some were very heavily censored; up to 95% of the data were below the LOD in some data sets. It was necessary, therefore, to use a method of calculating the arithmetic mean exposures that took into account the censored data. Three different methods were employed in an attempt to select the most appropriate method for the particular data in the study. A common method is to replace the missing (censored) values with half the detection limit. This method has been recommended for data sets where much of the data are below the limit of detection or where the data are highly skewed; with a geometric standard deviation of 3 or more. Another method, involving replacing the censored data with the limit of detection divided by the square root of 2, has been recommended when relatively few data are below the detection limit or where data are not highly skewed. A third method that was examined is Cohen's method. This involves mathematical extrapolation of the left-hand tail of the distribution, based on the distribution of the uncensored data, and calculation of the maximum likelihood estimate of the arithmetic mean. When these three methods were applied to the data in this study it was found that the first two simple methods give similar results in most cases. Cohen's method on the other hand, gave results that were generally, but not always, higher than simpler methods and in some cases gave extremely high and even implausible estimates of the mean. It appears that if the data deviate substantially from a simple log-normal distribution, particularly if high outliers are present, then Cohen's method produces erratic and unreliable estimates. After examining these results, and both the distributions and proportions of censored data, it was decided that the half limit of detection method was most suitable in this particular study.


30.00% 30.00%



Most algorithms that focus on discovering frequent patterns from data streams assumed that the machinery is capable of managing all the incoming transactions without any delay; or without the need to drop transactions. However, this assumption is often impractical due to the inherent characteristics of data stream environments. Especially under high load conditions, there is often a shortage of system resources to process the incoming transactions. This causes unwanted latencies that in turn, affects the applicability of the data mining models produced – which often has a small window of opportunity. We propose a load shedding algorithm to address this issue. The algorithm adaptively detects overload situations and drops transactions from data streams using a probabilistic model. We tested our algorithm on both synthetic and real-life datasets to verify the feasibility of our algorithm.


30.00% 30.00%



This paper presents an algorithm based on the Growing Self Organizing Map (GSOM) called the High Dimensional Growing Self Organizing Map with Randomness (HDGSOMr) that can cluster massive high dimensional data efficiently. The original GSOM algorithm is altered to accommodate for the issues related to massive high dimensional data. These modifications are presented in detail with experimental results of a massive real-world dataset.


30.00% 30.00%



With phenomenal increases in the generation and storage of digital audio data in several applications, there is growing need for organizing audio data in databases and providing users with fast access to desired data. This paper presents a scheme for the content-based query and retrieval of audio data stored in MIDI format. This is based on extraction of melody from the MIDI files and suitably comparing with the melody of the query. The results of retrieval using the proposed algorithm are presented.