774 resultados para outlier detection, data mining, gpgpu, gpu computing, supercomputing
Resumo:
In the framework of the European Project for Ice Coring in Antarctica (EPICA), a comprehensive glaciological pre-site survey has been carried out on Amundsenisen, Dronning Maud Land, East Antarctica, in the past decade. Within this survey, four intermediate-depth ice cores and 13 snow pits were analyzed for their ionic composition and interpreted with respect to the spatial and temporal variability of volcanic sulphate deposition. The comparison of the non-sea-salt (nss)-sulphate peaks that are related to the well-known eruptions of Pinatubo and Cerro Hudson in AD 1991 revealed sulphate depositions of comparable size (15.8 ± 3.4 kg/km**2) in 11 snow pits. There is a tendency to higher annual concentrations for smaller snow-accumulation rates. The combination of seasonal sodium and annually resolved nss-sulphate records allowed the establishment of a time-scale derived by annual-layer counting over the last 2000 years and thus a detailed chronology of annual volcanic sulphate deposition. Using a robust outlier detection algorithm, 49 volcanic eruptions were identified between AD 165 and 1997. The dating uncertainty is ±3 years between AD 1997 and 1601, around ±5 years between AD 1601 and 1257, and increasing to ±24 years at AD 165, improving the accuracy of the volcanic chronology during the penultimate millennium considerably.
Resumo:
The Microarray technique is rather powerful, as it allows to test up thousands of genes at a time, but this produces an overwhelming set of data files containing huge amounts of data, which is quite difficult to pre-process, separate, classify and correlate for interesting conclusions to be extracted. Modern machine learning, data mining and clustering techniques based on information theory, are needed to read and interpret the information contents buried in those large data sets. Independent Component Analysis method can be used to correct the data affected by corruption processes or to filter the uncorrectable one and then clustering methods can group similar genes or classify samples. In this paper a hybrid approach is used to obtain a two way unsupervised clustering for a corrected microarray data.
Resumo:
Sensor networks are increasingly becoming one of the main sources of Big Data on the Web. However, the observations that they produce are made available with heterogeneous schemas, vocabularies and data formats, making it difficult to share and reuse these data for other purposes than those for which they were originally set up. In this thesis we address these challenges, considering how we can transform streaming raw data to rich ontology-based information that is accessible through continuous queries for streaming data. Our main contribution is an ontology-based approach for providing data access and query capabilities to streaming data sources, allowing users to express their needs at a conceptual level, independent of implementation and language-specific details. We introduce novel query rewriting and data translation techniques that rely on mapping definitions relating streaming data models to ontological concepts. Specific contributions include: • The syntax and semantics of the SPARQLStream query language for ontologybased data access, and a query rewriting approach for transforming SPARQLStream queries into streaming algebra expressions. • The design of an ontology-based streaming data access engine that can internally reuse an existing data stream engine, complex event processor or sensor middleware, using R2RML mappings for defining relationships between streaming data models and ontology concepts. Concerning the sensor metadata of such streaming data sources, we have investigated how we can use raw measurements to characterize streaming data, producing enriched data descriptions in terms of ontological models. Our specific contributions are: • A representation of sensor data time series that captures gradient information that is useful to characterize types of sensor data. • A method for classifying sensor data time series and determining the type of data, using data mining techniques, and a method for extracting semantic sensor metadata features from the time series.
Resumo:
In the last decade, complex networks have widely been applied to the study of many natural and man-made systems, and to the extraction of meaningful information from the interaction structures created by genes and proteins. Nevertheless, less attention has been devoted to metabonomics, due to the lack of a natural network representation of spectral data. Here we define a technique for reconstructing networks from spectral data sets, where nodes represent spectral bins, and pairs of them are connected when their intensities follow a pattern associated with a disease. The structural analysis of the resulting network can then be used to feed standard data-mining algorithms, for instance for the classification of new (unlabeled) subjects. Furthermore, we show how the structure of the network is resilient to the presence of external additive noise, and how it can be used to extract relevant knowledge about the development of the disease.
Resumo:
Nanotechnology represents an area of particular promise and significant opportunity across multiple scientific disciplines. Ongoing nanotechnology research ranges from the characterization of nanoparticles and nanomaterials to the analysis and processing of experimental data seeking correlations between nanoparticles and their functionalities and side effects. Due to their special properties, nanoparticles are suitable for cellular-level diagnostics and therapy, offering numerous applications in medicine, e.g. development of biomedical devices, tissue repair, drug delivery systems and biosensors. In nanomedicine, recent studies are producing large amounts of structural and property data, highlighting the role for computational approaches in information management. While in vitro and in vivo assays are expensive, the cost of computing is falling. Furthermore, improvements in the accuracy of computational methods (e.g. data mining, knowledge discovery, modeling and simulation) have enabled effective tools to automate the extraction, management and storage of these vast data volumes. Since this information is widely distributed, one major issue is how to locate and access data where it resides (which also poses data-sharing limitations). The novel discipline of nanoinformatics addresses the information challenges related to nanotechnology research. In this paper, we summarize the needs and challenges in the field and present an overview of extant initiatives and efforts.
Resumo:
Los avances en el hardware permiten disponer de grandes volúmenes de datos, surgiendo aplicaciones que deben suministrar información en tiempo cuasi-real, la monitorización de pacientes, ej., el seguimiento sanitario de las conducciones de agua, etc. Las necesidades de estas aplicaciones hacen emerger el modelo de flujo de datos (data streaming) frente al modelo almacenar-para-despuésprocesar (store-then-process). Mientras que en el modelo store-then-process, los datos son almacenados para ser posteriormente consultados; en los sistemas de streaming, los datos son procesados a su llegada al sistema, produciendo respuestas continuas sin llegar a almacenarse. Esta nueva visión impone desafíos para el procesamiento de datos al vuelo: 1) las respuestas deben producirse de manera continua cada vez que nuevos datos llegan al sistema; 2) los datos son accedidos solo una vez y, generalmente, no son almacenados en su totalidad; y 3) el tiempo de procesamiento por dato para producir una respuesta debe ser bajo. Aunque existen dos modelos para el cómputo de respuestas continuas, el modelo evolutivo y el de ventana deslizante; éste segundo se ajusta mejor en ciertas aplicaciones al considerar únicamente los datos recibidos más recientemente, en lugar de todo el histórico de datos. En los últimos años, la minería de datos en streaming se ha centrado en el modelo evolutivo. Mientras que, en el modelo de ventana deslizante, el trabajo presentado es más reducido ya que estos algoritmos no sólo deben de ser incrementales si no que deben borrar la información que caduca por el deslizamiento de la ventana manteniendo los anteriores tres desafíos. Una de las tareas fundamentales en minería de datos es la búsqueda de agrupaciones donde, dado un conjunto de datos, el objetivo es encontrar grupos representativos, de manera que se tenga una descripción sintética del conjunto. Estas agrupaciones son fundamentales en aplicaciones como la detección de intrusos en la red o la segmentación de clientes en el marketing y la publicidad. Debido a las cantidades masivas de datos que deben procesarse en este tipo de aplicaciones (millones de eventos por segundo), las soluciones centralizadas puede ser incapaz de hacer frente a las restricciones de tiempo de procesamiento, por lo que deben recurrir a descartar datos durante los picos de carga. Para evitar esta perdida de datos, se impone el procesamiento distribuido de streams, en concreto, los algoritmos de agrupamiento deben ser adaptados para este tipo de entornos, en los que los datos están distribuidos. En streaming, la investigación no solo se centra en el diseño para tareas generales, como la agrupación, sino también en la búsqueda de nuevos enfoques que se adapten mejor a escenarios particulares. Como ejemplo, un mecanismo de agrupación ad-hoc resulta ser más adecuado para la defensa contra la denegación de servicio distribuida (Distributed Denial of Services, DDoS) que el problema tradicional de k-medias. En esta tesis se pretende contribuir en el problema agrupamiento en streaming tanto en entornos centralizados y distribuidos. Hemos diseñado un algoritmo centralizado de clustering mostrando las capacidades para descubrir agrupaciones de alta calidad en bajo tiempo frente a otras soluciones del estado del arte, en una amplia evaluación. Además, se ha trabajado sobre una estructura que reduce notablemente el espacio de memoria necesario, controlando, en todo momento, el error de los cómputos. Nuestro trabajo también proporciona dos protocolos de distribución del cómputo de agrupaciones. Se han analizado dos características fundamentales: el impacto sobre la calidad del clustering al realizar el cómputo distribuido y las condiciones necesarias para la reducción del tiempo de procesamiento frente a la solución centralizada. Finalmente, hemos desarrollado un entorno para la detección de ataques DDoS basado en agrupaciones. En este último caso, se ha caracterizado el tipo de ataques detectados y se ha desarrollado una evaluación sobre la eficiencia y eficacia de la mitigación del impacto del ataque. ABSTRACT Advances in hardware allow to collect huge volumes of data emerging applications that must provide information in near-real time, e.g., patient monitoring, health monitoring of water pipes, etc. The data streaming model emerges to comply with these applications overcoming the traditional store-then-process model. With the store-then-process model, data is stored before being consulted; while, in streaming, data are processed on the fly producing continuous responses. The challenges of streaming for processing data on the fly are the following: 1) responses must be produced continuously whenever new data arrives in the system; 2) data is accessed only once and is generally not maintained in its entirety, and 3) data processing time to produce a response should be low. Two models exist to compute continuous responses: the evolving model and the sliding window model; the latter fits best with applications must be computed over the most recently data rather than all the previous data. In recent years, research in the context of data stream mining has focused mainly on the evolving model. In the sliding window model, the work presented is smaller since these algorithms must be incremental and they must delete the information which expires when the window slides. Clustering is one of the fundamental techniques of data mining and is used to analyze data sets in order to find representative groups that provide a concise description of the data being processed. Clustering is critical in applications such as network intrusion detection or customer segmentation in marketing and advertising. Due to the huge amount of data that must be processed by such applications (up to millions of events per second), centralized solutions are usually unable to cope with timing restrictions and recur to shedding techniques where data is discarded during load peaks. To avoid discarding of data, processing of streams (such as clustering) must be distributed and adapted to environments where information is distributed. In streaming, research does not only focus on designing for general tasks, such as clustering, but also in finding new approaches that fit bests with particular scenarios. As an example, an ad-hoc grouping mechanism turns out to be more adequate than k-means for defense against Distributed Denial of Service (DDoS). This thesis contributes to the data stream mining clustering technique both for centralized and distributed environments. We present a centralized clustering algorithm showing capabilities to discover clusters of high quality in low time and we provide a comparison with existing state of the art solutions. We have worked on a data structure that significantly reduces memory requirements while controlling the error of the clusters statistics. We also provide two distributed clustering protocols. We focus on the analysis of two key features: the impact on the clustering quality when computation is distributed and the requirements for reducing the processing time compared to the centralized solution. Finally, with respect to ad-hoc grouping techniques, we have developed a DDoS detection framework based on clustering.We have characterized the attacks detected and we have evaluated the efficiency and effectiveness of mitigating the attack impact.
Resumo:
The use of data mining techniques for the gene profile discovery of diseases, such as cancer, is becoming usual in many researches. These techniques do not usually analyze the relationships between genes in depth, depending on the different variety of manifestations of the disease (related to patients). This kind of analysis takes a considerable amount of time and is not always the focus of the research. However, it is crucial in order to generate personalized treatments to fight the disease. Thus, this research focuses on finding a mechanism for gene profile analysis to be used by the medical and biologist experts. Results: In this research, the MedVir framework is proposed. It is an intuitive mechanism based on the visualization of medical data such as gene profiles, patients, clinical data, etc. MedVir, which is based on an Evolutionary Optimization technique, is a Dimensionality Reduction (DR) approach that presents the data in a three dimensional space. Furthermore, thanks to Virtual Reality technology, MedVir allows the expert to interact with the data in order to tailor it to the experience and knowledge of the expert.
Resumo:
La minería de datos es un campo de las ciencias de la computación referido al proceso que intenta descubrir patrones en grandes volúmenes de datos. La minería de datos busca generar información similar a la que podría producir un experto humano. Además es el proceso de descubrir conocimientos interesantes, como patrones, asociaciones, cambios, anomalías y estructuras significativas a partir de grandes cantidades de datos almacenadas en bases de datos, data warehouses o cualquier otro medio de almacenamiento de información. El aprendizaje automático o aprendizaje de máquinas es una rama de la Inteligencia artificial cuyo objetivo es desarrollar técnicas que permitan a las computadoras aprender. De forma más concreta, se trata de crear programas capaces de generalizar comportamientos a partir de una información no estructurada suministrada en forma de ejemplos. La minería de datos utiliza métodos de aprendizaje automático para descubrir y enumerar patrones presentes en los datos. En los últimos años se han aplicado las técnicas de clasificación y aprendizaje automático en un número elevado de ámbitos como el sanitario, comercial o de seguridad. Un ejemplo muy actual es la detección de comportamientos y transacciones fraudulentas en bancos. Una aplicación de interés es el uso de las técnicas desarrolladas para la detección de comportamientos fraudulentos en la identificación de usuarios existentes en el interior de entornos inteligentes sin necesidad de realizar un proceso de autenticación. Para comprobar que estas técnicas son efectivas durante la fase de análisis de una determinada solución, es necesario crear una plataforma que de soporte al desarrollo, validación y evaluación de algoritmos de aprendizaje y clasificación en los entornos de aplicación bajo estudio. El proyecto planteado está definido para la creación de una plataforma que permita evaluar algoritmos de aprendizaje automático como mecanismos de identificación en espacios inteligentes. Se estudiarán tanto los algoritmos propios de este tipo de técnicas como las plataformas actuales existentes para definir un conjunto de requisitos específicos de la plataforma a desarrollar. Tras el análisis se desarrollará parcialmente la plataforma. Tras el desarrollo se validará con pruebas de concepto y finalmente se verificará en un entorno de investigación a definir. ABSTRACT. The data mining is a field of the sciences of the computation referred to the process that it tries to discover patterns in big volumes of information. The data mining seeks to generate information similar to the one that a human expert might produce. In addition it is the process of discovering interesting knowledge, as patterns, associations, changes, abnormalities and significant structures from big quantities of information stored in databases, data warehouses or any other way of storage of information. The machine learning is a branch of the artificial Intelligence which aim is to develop technologies that they allow the computers to learn. More specifically, it is a question of creating programs capable of generalizing behaviors from not structured information supplied in the form of examples. The data mining uses methods of machine learning to discover and to enumerate present patterns in the information. In the last years there have been applied classification and machine learning techniques in a high number of areas such as healthcare, commercial or security. A very current example is the detection of behaviors and fraudulent transactions in banks. An application of interest is the use of the techniques developed for the detection of fraudulent behaviors in the identification of existing Users inside intelligent environments without need to realize a process of authentication. To verify these techniques are effective during the phase of analysis of a certain solution, it is necessary to create a platform that support the development, validation and evaluation of algorithms of learning and classification in the environments of application under study. The project proposed is defined for the creation of a platform that allows evaluating algorithms of machine learning as mechanisms of identification in intelligent spaces. There will be studied both the own algorithms of this type of technologies and the current existing platforms to define a set of specific requirements of the platform to develop. After the analysis the platform will develop partially. After the development it will be validated by prove of concept and finally verified in an environment of investigation that would be define.
Resumo:
Clinicians could model the brain injury of a patient through his brain activity. However, how this model is defined and how it changes when the patient is recovering are questions yet unanswered. In this paper, the use of MedVir framework is proposed with the aim of answering these questions. Based on complex data mining techniques, this provides not only the differentiation between TBI patients and control subjects (with a 72% of accuracy using 0.632 Bootstrap validation), but also the ability to detect whether a patient may recover or not, and all of that in a quick and easy way through a visualization technique which allows interaction.
Resumo:
The mobile apps market is a tremendous success, with millions of apps downloaded and used every day by users spread all around the world. For apps’ developers, having their apps published on one of the major app stores (e.g. Google Play market) is just the beginning of the apps lifecycle. Indeed, in order to successfully compete with the other apps in the market, an app has to be updated frequently by adding new attractive features and by fixing existing bugs. Clearly, any developer interested in increasing the success of her app should try to implement features desired by the app’s users and to fix bugs affecting the user experience of many of them. A precious source of information to decide how to collect users’ opinions and wishes is represented by the reviews left by users on the store from which they downloaded the app. However, to exploit such information the app’s developer should manually read each user review and verify if it contains useful information (e.g. suggestions for new features). This is something not doable if the app receives hundreds of reviews per day, as happens for the very popular apps on the market. In this work, our aim is to provide support to mobile apps developers by proposing a novel approach exploiting data mining, natural language processing, machine learning, and clustering techniques in order to classify the user reviews on the basis of the information they contain (e.g. useless, suggestion for new features, bugs reporting). Such an approach has been empirically evaluated and made available in a web-‐based tool publicly available to all apps’ developers. The achieved results showed that the developed tool: (i) is able to correctly categorise user reviews on the basis of their content (e.g. isolating those reporting bugs) with 78% of accuracy, (ii) produces clusters of reviews (e.g. groups together reviews indicating exactly the same bug to be fixed) that are meaningful from a developer’s point-‐of-‐view, and (iii) is considered useful by a software company working in the mobile apps’ development market.
Resumo:
Los Centros de Datos se encuentran actualmente en cualquier sector de la economía mundial. Están compuestos por miles de servidores, dando servicio a los usuarios de forma global, las 24 horas del día y los 365 días del año. Durante los últimos años, las aplicaciones del ámbito de la e-Ciencia, como la e-Salud o las Ciudades Inteligentes han experimentado un desarrollo muy significativo. La necesidad de manejar de forma eficiente las necesidades de cómputo de aplicaciones de nueva generación, junto con la creciente demanda de recursos en aplicaciones tradicionales, han facilitado el rápido crecimiento y la proliferación de los Centros de Datos. El principal inconveniente de este aumento de capacidad ha sido el rápido y dramático incremento del consumo energético de estas infraestructuras. En 2010, la factura eléctrica de los Centros de Datos representaba el 1.3% del consumo eléctrico mundial. Sólo en el año 2012, el consumo de potencia de los Centros de Datos creció un 63%, alcanzando los 38GW. En 2013 se estimó un crecimiento de otro 17%, hasta llegar a los 43GW. Además, los Centros de Datos son responsables de más del 2% del total de emisiones de dióxido de carbono a la atmósfera. Esta tesis doctoral se enfrenta al problema energético proponiendo técnicas proactivas y reactivas conscientes de la temperatura y de la energía, que contribuyen a tener Centros de Datos más eficientes. Este trabajo desarrolla modelos de energía y utiliza el conocimiento sobre la demanda energética de la carga de trabajo a ejecutar y de los recursos de computación y refrigeración del Centro de Datos para optimizar el consumo. Además, los Centros de Datos son considerados como un elemento crucial dentro del marco de la aplicación ejecutada, optimizando no sólo el consumo del Centro de Datos sino el consumo energético global de la aplicación. Los principales componentes del consumo en los Centros de Datos son la potencia de computación utilizada por los equipos de IT, y la refrigeración necesaria para mantener los servidores dentro de un rango de temperatura de trabajo que asegure su correcto funcionamiento. Debido a la relación cúbica entre la velocidad de los ventiladores y el consumo de los mismos, las soluciones basadas en el sobre-aprovisionamiento de aire frío al servidor generalmente tienen como resultado ineficiencias energéticas. Por otro lado, temperaturas más elevadas en el procesador llevan a un consumo de fugas mayor, debido a la relación exponencial del consumo de fugas con la temperatura. Además, las características de la carga de trabajo y las políticas de asignación de recursos tienen un impacto importante en los balances entre corriente de fugas y consumo de refrigeración. La primera gran contribución de este trabajo es el desarrollo de modelos de potencia y temperatura que permiten describes estos balances entre corriente de fugas y refrigeración; así como la propuesta de estrategias para minimizar el consumo del servidor por medio de la asignación conjunta de refrigeración y carga desde una perspectiva multivariable. Cuando escalamos a nivel del Centro de Datos, observamos un comportamiento similar en términos del balance entre corrientes de fugas y refrigeración. Conforme aumenta la temperatura de la sala, mejora la eficiencia de la refrigeración. Sin embargo, este incremente de la temperatura de sala provoca un aumento en la temperatura de la CPU y, por tanto, también del consumo de fugas. Además, la dinámica de la sala tiene un comportamiento muy desigual, no equilibrado, debido a la asignación de carga y a la heterogeneidad en el equipamiento de IT. La segunda contribución de esta tesis es la propuesta de técnicas de asigación conscientes de la temperatura y heterogeneidad que permiten optimizar conjuntamente la asignación de tareas y refrigeración a los servidores. Estas estrategias necesitan estar respaldadas por modelos flexibles, que puedan trabajar en tiempo real, para describir el sistema desde un nivel de abstracción alto. Dentro del ámbito de las aplicaciones de nueva generación, las decisiones tomadas en el nivel de aplicación pueden tener un impacto dramático en el consumo energético de niveles de abstracción menores, como por ejemplo, en el Centro de Datos. Es importante considerar las relaciones entre todos los agentes computacionales implicados en el problema, de forma que puedan cooperar para conseguir el objetivo común de reducir el coste energético global del sistema. La tercera contribución de esta tesis es el desarrollo de optimizaciones energéticas para la aplicación global por medio de la evaluación de los costes de ejecutar parte del procesado necesario en otros niveles de abstracción, que van desde los nodos hasta el Centro de Datos, por medio de técnicas de balanceo de carga. Como resumen, el trabajo presentado en esta tesis lleva a cabo contribuciones en el modelado y optimización consciente del consumo por fugas y la refrigeración de servidores; el modelado de los Centros de Datos y el desarrollo de políticas de asignación conscientes de la heterogeneidad; y desarrolla mecanismos para la optimización energética de aplicaciones de nueva generación desde varios niveles de abstracción. ABSTRACT Data centers are easily found in every sector of the worldwide economy. They consist of tens of thousands of servers, serving millions of users globally and 24-7. In the last years, e-Science applications such e-Health or Smart Cities have experienced a significant development. The need to deal efficiently with the computational needs of next-generation applications together with the increasing demand for higher resources in traditional applications has facilitated the rapid proliferation and growing of data centers. A drawback to this capacity growth has been the rapid increase of the energy consumption of these facilities. In 2010, data center electricity represented 1.3% of all the electricity use in the world. In year 2012 alone, global data center power demand grew 63% to 38GW. A further rise of 17% to 43GW was estimated in 2013. Moreover, data centers are responsible for more than 2% of total carbon dioxide emissions. This PhD Thesis addresses the energy challenge by proposing proactive and reactive thermal and energy-aware optimization techniques that contribute to place data centers on a more scalable curve. This work develops energy models and uses the knowledge about the energy demand of the workload to be executed and the computational and cooling resources available at data center to optimize energy consumption. Moreover, data centers are considered as a crucial element within their application framework, optimizing not only the energy consumption of the facility, but the global energy consumption of the application. The main contributors to the energy consumption in a data center are the computing power drawn by IT equipment and the cooling power needed to keep the servers within a certain temperature range that ensures safe operation. Because of the cubic relation of fan power with fan speed, solutions based on over-provisioning cold air into the server usually lead to inefficiencies. On the other hand, higher chip temperatures lead to higher leakage power because of the exponential dependence of leakage on temperature. Moreover, workload characteristics as well as allocation policies also have an important impact on the leakage-cooling tradeoffs. The first key contribution of this work is the development of power and temperature models that accurately describe the leakage-cooling tradeoffs at the server level, and the proposal of strategies to minimize server energy via joint cooling and workload management from a multivariate perspective. When scaling to the data center level, a similar behavior in terms of leakage-temperature tradeoffs can be observed. As room temperature raises, the efficiency of data room cooling units improves. However, as we increase room temperature, CPU temperature raises and so does leakage power. Moreover, the thermal dynamics of a data room exhibit unbalanced patterns due to both the workload allocation and the heterogeneity of computing equipment. The second main contribution is the proposal of thermal- and heterogeneity-aware workload management techniques that jointly optimize the allocation of computation and cooling to servers. These strategies need to be backed up by flexible room level models, able to work on runtime, that describe the system from a high level perspective. Within the framework of next-generation applications, decisions taken at this scope can have a dramatical impact on the energy consumption of lower abstraction levels, i.e. the data center facility. It is important to consider the relationships between all the computational agents involved in the problem, so that they can cooperate to achieve the common goal of reducing energy in the overall system. The third main contribution is the energy optimization of the overall application by evaluating the energy costs of performing part of the processing in any of the different abstraction layers, from the node to the data center, via workload management and off-loading techniques. In summary, the work presented in this PhD Thesis, makes contributions on leakage and cooling aware server modeling and optimization, data center thermal modeling and heterogeneityaware data center resource allocation, and develops mechanisms for the energy optimization for next-generation applications from a multi-layer perspective.
Resumo:
Stream-mining approach is defined as a set of cutting-edge techniques designed to process streams of data in real time, in order to extract knowledge. In the particular case of classification, stream-mining has to adapt its behaviour to the volatile underlying data distributions, what has been called concept drift. Moreover, it is important to note that concept drift may lead to situations where predictive models become invalid and have therefore to be updated to represent the actual concepts that data poses. In this context, there is a specific type of concept drift, known as recurrent concept drift, where the concepts represented by data have already appeared in the past. In those cases the learning process could be saved or at least minimized by applying a previously trained model. This could be extremely useful in ubiquitous environments that are characterized by the existence of resource constrained devices. To deal with the aforementioned scenario, meta-models can be used in the process of enhancing the drift detection mechanisms used by data stream algorithms, by representing and predicting when the change will occur. There are some real-world situations where a concept reappears, as in the case of intrusion detection systems (IDS), where the same incidents or an adaptation of them usually reappear over time. In these environments the early prediction of drift by means of a better knowledge of past models can help to anticipate to the change, thus improving efficiency of the model regarding the training instances needed. By means of using meta-models as a recurrent drift detection mechanism, the ability to share concepts representations among different data mining processes is open. That kind of exchanges could improve the accuracy of the resultant local model as such model may benefit from patterns similar to the local concept that were observed in other scenarios, but not yet locally. This would also improve the efficiency of training instances used during the classification process, as long as the exchange of models would aid in the application of already trained recurrent models, that have been previously seen by any of the collaborative devices. Which it is to say that the scope of recurrence detection and representation is broaden. In fact the detection, representation and exchange of concept drift patterns would be extremely useful for the law enforcement activities fighting against cyber crime. Being the information exchange one of the main pillars of cooperation, national units would benefit from the experience and knowledge gained by third parties. Moreover, in the specific scope of critical infrastructures protection it is crucial to count with information exchange mechanisms, both from a strategical and technical scope. The exchange of concept drift detection schemes in cyber security environments would aid in the process of preventing, detecting and effectively responding to threads in cyber space. Furthermore, as a complement of meta-models, a mechanism to assess the similarity between classification models is also needed when dealing with recurrent concepts. In this context, when reusing a previously trained model a rough comparison between concepts is usually made, applying boolean logic. The introduction of fuzzy logic comparisons between models could lead to a better efficient reuse of previously seen concepts, by applying not just equal models, but also similar ones. This work faces the aforementioned open issues by means of: the MMPRec system, that integrates a meta-model mechanism and a fuzzy similarity function; a collaborative environment to share meta-models between different devices; a recurrent drift generator that allows to test the usefulness of recurrent drift systems, as it is the case of MMPRec. Moreover, this thesis presents an experimental validation of the proposed contributions using synthetic and real datasets.
Resumo:
© The Author(s) 2014. Acknowledgements We thank the Information Services Division, Scotland, who provided the SMR01 data, and NHS Grampian, who provided the biochemistry data. We also thank the University of Aberdeen’s Data Management Team. Funding This work was supported by the Chief Scientists Office for Scotland (grant no. CZH/4/656).