788 resultados para data mining applications


Relevância:

100.00% 100.00%

Publicador:

Resumo:

Data mining is a relatively new field of research that its objective is to acquire knowledge from large amounts of data. In medical and health care areas, due to regulations and due to the availability of computers, a large amount of data is becoming available [27]. On the one hand, practitioners are expected to use all this data in their work but, at the same time, such a large amount of data cannot be processed by humans in a short time to make diagnosis, prognosis and treatment schedules. A major objective of this thesis is to evaluate data mining tools in medical and health care applications to develop a tool that can help make rather accurate decisions. In this thesis, the goal is finding a pattern among patients who got pneumonia by clustering of lab data values which have been recorded every day. By this pattern we can generalize it to the patients who did not have been diagnosed by this disease whose lab values shows the same trend as pneumonia patients does. There are 10 tables which have been extracted from a big data base of a hospital in Jena for my work .In ICU (intensive care unit), COPRA system which is a patient management system has been used. All the tables and data stored in German Language database.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The multi-relational Data Mining approach has emerged as alternative to the analysis of structured data, such as relational databases. Unlike traditional algorithms, the multi-relational proposals allow mining directly multiple tables, avoiding the costly join operations. In this paper, is presented a comparative study involving the traditional Patricia Mine algorithm and its corresponding multi-relational proposed, MR-Radix in order to evaluate the performance of two approaches for mining association rules are used for relational databases. This study presents two original contributions: the proposition of an algorithm multi-relational MR-Radix, which is efficient for use in relational databases, both in terms of execution time and in relation to memory usage and the presentation of the empirical approach multirelational advantage in performance over several tables, which avoids the costly join operations from multiple tables. © 2011 IEEE.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Multi-relational data mining enables pattern mining from multiple tables. The existing multi-relational mining association rules algorithms are not able to process large volumes of data, because the amount of memory required exceeds the amount available. The proposed algorithm MRRadix presents a framework that promotes the optimization of memory usage. It also uses the concept of partitioning to handle large volumes of data. The original contribution of this proposal is enable a superior performance when compared to other related algorithms and moreover successfully concludes the task of mining association rules in large databases, bypass the problem of available memory. One of the tests showed that the MR-Radix presents fourteen times less memory usage than the GFP-growth. © 2011 IEEE.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Concept drift, which refers to non stationary learning problems over time, has increasing importance in machine learning and data mining. Many concept drift applications require fast response, which means an algorithm must always be (re)trained with the latest available data. But the process of data labeling is usually expensive and/or time consuming when compared to acquisition of unlabeled data, thus usually only a small fraction of the incoming data may be effectively labeled. Semi-supervised learning methods may help in this scenario, as they use both labeled and unlabeled data in the training process. However, most of them are based on assumptions that the data is static. Therefore, semi-supervised learning with concept drifts is still an open challenging task in machine learning. Recently, a particle competition and cooperation approach has been developed to realize graph-based semi-supervised learning from static data. We have extend that approach to handle data streams and concept drift. The result is a passive algorithm which uses a single classifier approach, naturally adapted to concept changes without any explicit drift detection mechanism. It has built-in mechanisms that provide a natural way of learning from new data, gradually "forgetting" older knowledge as older data items are no longer useful for the classification of newer data items. The proposed algorithm is applied to the KDD Cup 1999 Data of network intrusion, showing its effectiveness.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

In [1], the authors proposed a framework for automated clustering and visualization of biological data sets named AUTO-HDS. This letter is intended to complement that framework by showing that it is possible to get rid of a user-defined parameter in a way that the clustering stage can be implemented more accurately while having reduced computational complexity

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Biodiversity, a multidimensional property of natural systems, is difficult to quantify partly because of the multitude of indices proposed for this purpose. Indices aim to describe general properties of communities that allow us to compare different regions, taxa, and trophic levels. Therefore, they are of fundamental importance for environmental monitoring and conservation, although there is no consensus about which indices are more appropriate and informative. We tested several common diversity indices in a range of simple to complex statistical analyses in order to determine whether some were better suited for certain analyses than others. We used data collected around the focal plant Plantago lanceolata on 60 temperate grassland plots embedded in an agricultural landscape to explore relationships between the common diversity indices of species richness (S), Shannon's diversity (H'), Simpson's diversity (D-1), Simpson's dominance (D-2), Simpson's evenness (E), and Berger-Parker dominance (BP). We calculated each of these indices for herbaceous plants, arbuscular mycorrhizal fungi, aboveground arthropods, belowground insect larvae, and P.lanceolata molecular and chemical diversity. Including these trait-based measures of diversity allowed us to test whether or not they behaved similarly to the better studied species diversity. We used path analysis to determine whether compound indices detected more relationships between diversities of different organisms and traits than more basic indices. In the path models, more paths were significant when using H', even though all models except that with E were equally reliable. This demonstrates that while common diversity indices may appear interchangeable in simple analyses, when considering complex interactions, the choice of index can profoundly alter the interpretation of results. Data mining in order to identify the index producing the most significant results should be avoided, but simultaneously considering analyses using multiple indices can provide greater insight into the interactions in a system.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The manipulation and handling of an ever increasing volume of data by current data-intensive applications require novel techniques for e?cient data management. Despite recent advances in every aspect of data management (storage, access, querying, analysis, mining), future applications are expected to scale to even higher degrees, not only in terms of volumes of data handled but also in terms of users and resources, often making use of multiple, pre-existing autonomous, distributed or heterogeneous resources.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Hoy en día, con la evolución continua y rápida de las tecnologías de la información y los dispositivos de computación, se recogen y almacenan continuamente grandes volúmenes de datos en distintos dominios y a través de diversas aplicaciones del mundo real. La extracción de conocimiento útil de una cantidad tan enorme de datos no se puede realizar habitualmente de forma manual, y requiere el uso de técnicas adecuadas de aprendizaje automático y de minería de datos. La clasificación es una de las técnicas más importantes que ha sido aplicada con éxito a varias áreas. En general, la clasificación se compone de dos pasos principales: en primer lugar, aprender un modelo de clasificación o clasificador a partir de un conjunto de datos de entrenamiento, y en segundo lugar, clasificar las nuevas instancias de datos utilizando el clasificador aprendido. La clasificación es supervisada cuando todas las etiquetas están presentes en los datos de entrenamiento (es decir, datos completamente etiquetados), semi-supervisada cuando sólo algunas etiquetas son conocidas (es decir, datos parcialmente etiquetados), y no supervisada cuando todas las etiquetas están ausentes en los datos de entrenamiento (es decir, datos no etiquetados). Además, aparte de esta taxonomía, el problema de clasificación se puede categorizar en unidimensional o multidimensional en función del número de variables clase, una o más, respectivamente; o también puede ser categorizado en estacionario o cambiante con el tiempo en función de las características de los datos y de la tasa de cambio subyacente. A lo largo de esta tesis, tratamos el problema de clasificación desde tres perspectivas diferentes, a saber, clasificación supervisada multidimensional estacionaria, clasificación semisupervisada unidimensional cambiante con el tiempo, y clasificación supervisada multidimensional cambiante con el tiempo. Para llevar a cabo esta tarea, hemos usado básicamente los clasificadores Bayesianos como modelos. La primera contribución, dirigiéndose al problema de clasificación supervisada multidimensional estacionaria, se compone de dos nuevos métodos de aprendizaje de clasificadores Bayesianos multidimensionales a partir de datos estacionarios. Los métodos se proponen desde dos puntos de vista diferentes. El primer método, denominado CB-MBC, se basa en una estrategia de envoltura de selección de variables que es voraz y hacia delante, mientras que el segundo, denominado MB-MBC, es una estrategia de filtrado de variables con una aproximación basada en restricciones y en el manto de Markov. Ambos métodos han sido aplicados a dos problemas reales importantes, a saber, la predicción de los inhibidores de la transcriptasa inversa y de la proteasa para el problema de infección por el virus de la inmunodeficiencia humana tipo 1 (HIV-1), y la predicción del European Quality of Life-5 Dimensions (EQ-5D) a partir de los cuestionarios de la enfermedad de Parkinson con 39 ítems (PDQ-39). El estudio experimental incluye comparaciones de CB-MBC y MB-MBC con los métodos del estado del arte de la clasificación multidimensional, así como con métodos comúnmente utilizados para resolver el problema de predicción de la enfermedad de Parkinson, a saber, la regresión logística multinomial, mínimos cuadrados ordinarios, y mínimas desviaciones absolutas censuradas. En ambas aplicaciones, los resultados han sido prometedores con respecto a la precisión de la clasificación, así como en relación al análisis de las estructuras gráficas que identifican interacciones conocidas y novedosas entre las variables. La segunda contribución, referida al problema de clasificación semi-supervisada unidimensional cambiante con el tiempo, consiste en un método nuevo (CPL-DS) para clasificar flujos de datos parcialmente etiquetados. Los flujos de datos difieren de los conjuntos de datos estacionarios en su proceso de generación muy rápido y en su aspecto de cambio de concepto. Es decir, los conceptos aprendidos y/o la distribución subyacente están probablemente cambiando y evolucionando en el tiempo, lo que hace que el modelo de clasificación actual sea obsoleto y deba ser actualizado. CPL-DS utiliza la divergencia de Kullback-Leibler y el método de bootstrapping para cuantificar y detectar tres tipos posibles de cambio: en las predictoras, en la a posteriori de la clase o en ambas. Después, si se detecta cualquier cambio, un nuevo modelo de clasificación se aprende usando el algoritmo EM; si no, el modelo de clasificación actual se mantiene sin modificaciones. CPL-DS es general, ya que puede ser aplicado a varios modelos de clasificación. Usando dos modelos diferentes, el clasificador naive Bayes y la regresión logística, CPL-DS se ha probado con flujos de datos sintéticos y también se ha aplicado al problema real de la detección de código malware, en el cual los nuevos ficheros recibidos deben ser continuamente clasificados en malware o goodware. Los resultados experimentales muestran que nuestro método es efectivo para la detección de diferentes tipos de cambio a partir de los flujos de datos parcialmente etiquetados y también tiene una buena precisión de la clasificación. Finalmente, la tercera contribución, sobre el problema de clasificación supervisada multidimensional cambiante con el tiempo, consiste en dos métodos adaptativos, a saber, Locally Adpative-MB-MBC (LA-MB-MBC) y Globally Adpative-MB-MBC (GA-MB-MBC). Ambos métodos monitorizan el cambio de concepto a lo largo del tiempo utilizando la log-verosimilitud media como métrica y el test de Page-Hinkley. Luego, si se detecta un cambio de concepto, LA-MB-MBC adapta el actual clasificador Bayesiano multidimensional localmente alrededor de cada nodo cambiado, mientras que GA-MB-MBC aprende un nuevo clasificador Bayesiano multidimensional. El estudio experimental realizado usando flujos de datos sintéticos multidimensionales indica los méritos de los métodos adaptativos propuestos. ABSTRACT Nowadays, with the ongoing and rapid evolution of information technology and computing devices, large volumes of data are continuously collected and stored in different domains and through various real-world applications. Extracting useful knowledge from such a huge amount of data usually cannot be performed manually, and requires the use of adequate machine learning and data mining techniques. Classification is one of the most important techniques that has been successfully applied to several areas. Roughly speaking, classification consists of two main steps: first, learn a classification model or classifier from an available training data, and secondly, classify the new incoming unseen data instances using the learned classifier. Classification is supervised when the whole class values are present in the training data (i.e., fully labeled data), semi-supervised when only some class values are known (i.e., partially labeled data), and unsupervised when the whole class values are missing in the training data (i.e., unlabeled data). In addition, besides this taxonomy, the classification problem can be categorized into uni-dimensional or multi-dimensional depending on the number of class variables, one or more, respectively; or can be also categorized into stationary or streaming depending on the characteristics of the data and the rate of change underlying it. Through this thesis, we deal with the classification problem under three different settings, namely, supervised multi-dimensional stationary classification, semi-supervised unidimensional streaming classification, and supervised multi-dimensional streaming classification. To accomplish this task, we basically used Bayesian network classifiers as models. The first contribution, addressing the supervised multi-dimensional stationary classification problem, consists of two new methods for learning multi-dimensional Bayesian network classifiers from stationary data. They are proposed from two different points of view. The first method, named CB-MBC, is based on a wrapper greedy forward selection approach, while the second one, named MB-MBC, is a filter constraint-based approach based on Markov blankets. Both methods are applied to two important real-world problems, namely, the prediction of the human immunodeficiency virus type 1 (HIV-1) reverse transcriptase and protease inhibitors, and the prediction of the European Quality of Life-5 Dimensions (EQ-5D) from 39-item Parkinson’s Disease Questionnaire (PDQ-39). The experimental study includes comparisons of CB-MBC and MB-MBC against state-of-the-art multi-dimensional classification methods, as well as against commonly used methods for solving the Parkinson’s disease prediction problem, namely, multinomial logistic regression, ordinary least squares, and censored least absolute deviations. For both considered case studies, results are promising in terms of classification accuracy as well as regarding the analysis of the learned MBC graphical structures identifying known and novel interactions among variables. The second contribution, addressing the semi-supervised uni-dimensional streaming classification problem, consists of a novel method (CPL-DS) for classifying partially labeled data streams. Data streams differ from the stationary data sets by their highly rapid generation process and their concept-drifting aspect. That is, the learned concepts and/or the underlying distribution are likely changing and evolving over time, which makes the current classification model out-of-date requiring to be updated. CPL-DS uses the Kullback-Leibler divergence and bootstrapping method to quantify and detect three possible kinds of drift: feature, conditional or dual. Then, if any occurs, a new classification model is learned using the expectation-maximization algorithm; otherwise, the current classification model is kept unchanged. CPL-DS is general as it can be applied to several classification models. Using two different models, namely, naive Bayes classifier and logistic regression, CPL-DS is tested with synthetic data streams and applied to the real-world problem of malware detection, where the new received files should be continuously classified into malware or goodware. Experimental results show that our approach is effective for detecting different kinds of drift from partially labeled data streams, as well as having a good classification performance. Finally, the third contribution, addressing the supervised multi-dimensional streaming classification problem, consists of two adaptive methods, namely, Locally Adaptive-MB-MBC (LA-MB-MBC) and Globally Adaptive-MB-MBC (GA-MB-MBC). Both methods monitor the concept drift over time using the average log-likelihood score and the Page-Hinkley test. Then, if a drift is detected, LA-MB-MBC adapts the current multi-dimensional Bayesian network classifier locally around each changed node, whereas GA-MB-MBC learns a new multi-dimensional Bayesian network classifier from scratch. Experimental study carried out using synthetic multi-dimensional data streams shows the merits of both proposed adaptive methods.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Quantile computation has many applications including data mining and financial data analysis. It has been shown that an is an element of-approximate summary can be maintained so that, given a quantile query d (phi, is an element of), the data item at rank [phi N] may be approximately obtained within the rank error precision is an element of N over all N data items in a data stream or in a sliding window. However, scalable online processing of massive continuous quantile queries with different phi and is an element of poses a new challenge because the summary is continuously updated with new arrivals of data items. In this paper, first we aim to dramatically reduce the number of distinct query results by grouping a set of different queries into a cluster so that they can be processed virtually as a single query while the precision requirements from users can be retained. Second, we aim to minimize the total query processing costs. Efficient algorithms are developed to minimize the total number of times for reprocessing clusters and to produce the minimum number of clusters, respectively. The techniques are extended to maintain near-optimal clustering when queries are registered and removed in an arbitrary fashion against whole data streams or sliding windows. In addition to theoretical analysis, our performance study indicates that the proposed techniques are indeed scalable with respect to the number of input queries as well as the number of items and the item arrival rate in a data stream.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Hierarchical visualization systems are desirable because a single two-dimensional visualization plot may not be sufficient to capture all of the interesting aspects of complex high-dimensional data sets. We extend an existing locally linear hierarchical visualization system PhiVis [1] in several directions: bf(1) we allow for em non-linear projection manifolds (the basic building block is the Generative Topographic Mapping -- GTM), bf(2) we introduce a general formulation of hierarchical probabilistic models consisting of local probabilistic models organized in a hierarchical tree, bf(3) we describe folding patterns of low-dimensional projection manifold in high-dimensional data space by computing and visualizing the manifold's local directional curvatures. Quantities such as magnification factors [3] and directional curvatures are helpful for understanding the layout of the nonlinear projection manifold in the data space and for further refinement of the hierarchical visualization plot. Like PhiVis, our system is statistically principled and is built interactively in a top-down fashion using the EM algorithm. We demonstrate the visualization system principle of the approach on a complex 12-dimensional data set and mention possible applications in the pharmaceutical industry.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

The method (algorithm BIDIMS) of multivariate objects display to bidimensional structure in which the sum of differences of objects properties and their nearest neighbors is minimal is being described. The basic regularities on the set of objects at this ordering become evident. Besides, such structures (tables) have high inductive opportunities: many latent properties of objects may be predicted on their coordinates in this table. Opportunities of a method are illustrated on an example of bidimentional ordering of chemical elements. The table received in result practically coincides with the periodic Mendeleev table.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

A novel association rule mining algorithm is composed, using the unit cube chain decomposition structures introduced in [HAN, 1966; TON, 1976]. [HAN, 1966] established the chain split theory. [TON, 1976] invented an excellent chain computation framework which brings chain split into the practical domain. We integrate these technologies around the rule mining procedures. Effectiveness is related to the intention of low complexity of rules mined. Complexity of the procedure composed is complementary to the known Apriori algorithm which is defacto standard in rule mining area.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Ensemble Stream Modeling and Data-cleaning are sensor information processing systems have different training and testing methods by which their goals are cross-validated. This research examines a mechanism, which seeks to extract novel patterns by generating ensembles from data. The main goal of label-less stream processing is to process the sensed events to eliminate the noises that are uncorrelated, and choose the most likely model without over fitting thus obtaining higher model confidence. Higher quality streams can be realized by combining many short streams into an ensemble which has the desired quality. The framework for the investigation is an existing data mining tool. First, to accommodate feature extraction such as a bush or natural forest-fire event we make an assumption of the burnt area (BA*), sensed ground truth as our target variable obtained from logs. Even though this is an obvious model choice the results are disappointing. The reasons for this are two: One, the histogram of fire activity is highly skewed. Two, the measured sensor parameters are highly correlated. Since using non descriptive features does not yield good results, we resort to temporal features. By doing so we carefully eliminate the averaging effects; the resulting histogram is more satisfactory and conceptual knowledge is learned from sensor streams. Second is the process of feature induction by cross-validating attributes with single or multi-target variables to minimize training error. We use F-measure score, which combines precision and accuracy to determine the false alarm rate of fire events. The multi-target data-cleaning trees use information purity of the target leaf-nodes to learn higher order features. A sensitive variance measure such as ƒ-test is performed during each node's split to select the best attribute. Ensemble stream model approach proved to improve when using complicated features with a simpler tree classifier. The ensemble framework for data-cleaning and the enhancements to quantify quality of fitness (30% spatial, 10% temporal, and 90% mobility reduction) of sensor led to the formation of streams for sensor-enabled applications. Which further motivates the novelty of stream quality labeling and its importance in solving vast amounts of real-time mobile streams generated today.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Many systems and applications are continuously producing events. These events are used to record the status of the system and trace the behaviors of the systems. By examining these events, system administrators can check the potential problems of these systems. If the temporal dynamics of the systems are further investigated, the underlying patterns can be discovered. The uncovered knowledge can be leveraged to predict the future system behaviors or to mitigate the potential risks of the systems. Moreover, the system administrators can utilize the temporal patterns to set up event management rules to make the system more intelligent. With the popularity of data mining techniques in recent years, these events grad- ually become more and more useful. Despite the recent advances of the data mining techniques, the application to system event mining is still in a rudimentary stage. Most of works are still focusing on episodes mining or frequent pattern discovering. These methods are unable to provide a brief yet comprehensible summary to reveal the valuable information from the high level perspective. Moreover, these methods provide little actionable knowledge to help the system administrators to better man- age the systems. To better make use of the recorded events, more practical techniques are required. From the perspective of data mining, three correlated directions are considered to be helpful for system management: (1) Provide concise yet comprehensive summaries about the running status of the systems; (2) Make the systems more intelligence and autonomous; (3) Effectively detect the abnormal behaviors of the systems. Due to the richness of the event logs, all these directions can be solved in the data-driven manner. And in this way, the robustness of the systems can be enhanced and the goal of autonomous management can be approached. This dissertation mainly focuses on the foregoing directions that leverage tem- poral mining techniques to facilitate system management. More specifically, three concrete topics will be discussed, including event, resource demand prediction, and streaming anomaly detection. Besides the theoretic contributions, the experimental evaluation will also be presented to demonstrate the effectiveness and efficacy of the corresponding solutions.

Relevância:

100.00% 100.00%

Publicador:

Resumo:

Internet users consume online targeted advertising based on information collected about them and voluntarily share personal information in social networks. Sensor information and data from smart-phones is collected and used by applications, sometimes in unclear ways. As it happens today with smartphones, in the near future sensors will be shipped in all types of connected devices, enabling ubiquitous information gathering from the physical environment, enabling the vision of Ambient Intelligence. The value of gathered data, if not obvious, can be harnessed through data mining techniques and put to use by enabling personalized and tailored services as well as business intelligence practices, fueling the digital economy. However, the ever-expanding information gathering and use undermines the privacy conceptions of the past. Natural social practices of managing privacy in daily relations are overridden by socially-awkward communication tools, service providers struggle with security issues resulting in harmful data leaks, governments use mass surveillance techniques, the incentives of the digital economy threaten consumer privacy, and the advancement of consumergrade data-gathering technology enables new inter-personal abuses. A wide range of fields attempts to address technology-related privacy problems, however they vary immensely in terms of assumptions, scope and approach. Privacy of future use cases is typically handled vertically, instead of building upon previous work that can be re-contextualized, while current privacy problems are typically addressed per type in a more focused way. Because significant effort was required to make sense of the relations and structure of privacy-related work, this thesis attempts to transmit a structured view of it. It is multi-disciplinary - from cryptography to economics, including distributed systems and information theory - and addresses privacy issues of different natures. As existing work is framed and discussed, the contributions to the state-of-theart done in the scope of this thesis are presented. The contributions add to five distinct areas: 1) identity in distributed systems; 2) future context-aware services; 3) event-based context management; 4) low-latency information flow control; 5) high-dimensional dataset anonymity. Finally, having laid out such landscape of the privacy-preserving work, the current and future privacy challenges are discussed, considering not only technical but also socio-economic perspectives.